Optimality of Graphlet Screening in High Dimensional 

Variable Selection 



Jiashun Jin^, Cun-Hui Zhang^ and Qi Zhang^ 
■■^Carnegie Mellon University, ^Rutgers University, ^ University of Pittsburgh 

March 3, 2013 
Abstract 

Consider a linear model Y = + az, where X has n rows and p columns and 
z ~ iV(0,/„). We assume both p and n are large, including the case of p » n. The 
unknown signal vector (3 is assumed to be sparse in the sense that only a small fraction 
of its components is nonzero. The goal is to identify such nonzero coordinates (i.e., 
variable selection). 

Wc are primarily interested in the regime where signals are both rare and weak 
so that successful variable selection is challenging but is still possible. Researches on 
rare and weak signals to date have been focused on the unstructured case, where the 
Gram matrix G = X'X is nearly orthogonal. In this paper, G is only assumed to be 
sparse in the sense that each row of G has relatively few large coordinates (diagonals 
of G are normalized to 1). The sparsity of G naturally induces the sparsity of the so- 
called graph of strong dependence (GOSD). The key insight is that there is an interesting 
interplay between the signal sparsity and graph sparsity: in a broad context, the signals 
decompose into many small-size components of GOSD that are disconnected to each 
other. 

We propose Graphlet Screening (GS) for variable selection. This is a two-step Screen 
and Clean procedure, where in the first step, we screen subgraphs of GOSD with se- 
quential x^-tests, and in the second step, we clean with penalized MLE. The main 
methodological innovation is to use GOSD to guide both the screening and cleaning 
processes. 

For any variable selection procedure /3, we measure its performance with the Ham- 
ming distance between the sign vectors of /3 and (3, and assess the optimality by the 
convergence rate of the Hamming distance. Compared with more stringent criteri- 
ons such as exact support recovery or oracle property, which demand strong signals, 
the Hamming distance criterion is more appropriate for weak signals since it naturally 
allows a small fraction of errors. 

We show that in a broad class of situations, Graphlet Screening achieves the optimal 
rate of convergence in terms of the Hamming distance. Well-known procedures such 
as the Z/°-penalization method and the L^-penalization methods do not utilize graph 
structure for variable selection, so they generally do not achieve the optimal rate of 
convergence, even in very simple settings and even when the tuning parameters are 
ideally set. 

Keywords: Asymptotic minimaxity, graph of strong dependence (GOSD), 
graph of least favorables (GOLF), Hamming distance, Graphlet Screening, phase 
diagram. Rare and Weak signal model, Screen and Clean, sparsity. 

AMS 2000 subject classifications: Primary 62G05, 62G10; secondary 62G20. 



Acknowledgments: We thank Pengsheng Ji and Tracy Ke for help and pointers. JJ 
and QZ were partially supported by NSF CAREER award DMS-0908613. QZ was also 
partially supported by the NIH Grant P50-MH080215-02. CHZ was partially supported 
by the NSF Grants DMS-0906420, DMS-1106753 and NSA Grant H98230-1 1-1-0205. The 
research was supported in part by Computational Resources on PittGrid. 

1 Introduction 

Consider a linear regression model 

Y = Xp + az, zr^N{0,In), (1.1) 

where the design matrix X = Xn,p has n rows and p columns. Throughout this paper, we 
assume the diagonals of the Gram matrix 

G = X'X 

are normalized to 1 (and approximately 1 in the random design model). Motivated by 
the recent trend of 'Big Data' where massive datasets consisting of millions or billions of 
observations and variables are mined for associations and patterns (e.g. genomics, com- 
pressive sensing), we are primarily interested in the case where both p and n are large with 
p > n (though this should not be taken as a restriction). The signal vector (3 is unknown 
to us, but is presumably sparse in the sense that only a small proportion of its coordinates 
is nonzero. The main interest of this paper is to identify such nonzero coordinates (i.e., 
variable selection). 

1.1 The paradigm of rare and weak signals 

We are primarily interested in the regime where the signals are both rare and weak: Whether 
we are talking about clickstreams in web browsing or genome scans or tick-by-tick financial 
data, most of what we see is noise; the signals, mostly very subtle, are hard to find, and 
it's easy to be fooled. 

While rarity (or sparsity) of the signal is a well-accepted concept in high dimensional 
data analysis, the weakness of the signal is a much neglected notion. Many contemporary 
studies of variable selection have focused on rare and strong signals, where the so-called 
oracle property or probability of exact support recovery are used as the measure of optimal- 
ity. Typically, these works assume the signals are sufficiently strong, so that the variable 
selection problem does not involve the subtle tradeoff between signal sparsity and signal 
strength. However, such a tradeoff is of great interest from both scientific and practical 
perspectives. 

In this paper, we focus on the regime where the signals are so rare and weak that they are 
barely separable from the noise. We are interested in the exact demarcation that separates 
the region of impossibility from the region of possibility. In the region of impossibility, the 
signal is so rare and weak that successful variable selection is impossible. In the region of 
possibility, the signals are strong enough so that successful variable selection is possible, 
in the sense of committing a much smaller number of selection errors than the number of 
signals. This is a very delicate situation, where it is of major interest to develop methods 
that yields successful variable selection. 
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When signals are rare and weak, exact recovery is usually impossible, and oracle prop- 
erty or probability of exact support recovery is no longer an appropriate criterion for as- 
sessing optimality. In this paper, we use the minimax Hamming distance as a measure of 
optimality. Hamming distance is the expected number of components for which the esti- 
mated signs and true signs of the regression coefficients disagree. Our primary goal is to 
study the rate of minimax Hamming distance and to develop procedures that achieve the 
minimax rate. 



1.2 Exploiting the sparsity of the graph of strong dependence 

Most of the work to date on "rare and weak" effects consider the completely unstructured 
case where no two features interact with each other in a significant way IT^ 1^51 . 
However, in numerous applications, there are relationships between predictors which are 
important to consider. 

In this paper, we are primarily interested in the class of linear models where the Gram 
matrix G is 'sparse', in the sense that each row of G only has relatively few large coordinates. 
Linear models where G are sparse can be found in the following application areas. 

• Compressive sensing. We are interested in a very high dimensional sparse vector ^. 
The plan is to store or transmit n linear functionals of /3 and then reconstruct it. For 
1 < i < n, we choose a p-dimensional coefficient vector Xj and observe Yi = X'-f3 + Zi 
with an error Zj. The so-called Gaussian design is often considered [HlIiniE], where 

Xi *~ A^(0,i7/n) with a sparse covariance matrix U. In this example, the sparsity of 
induces that of G = X'X. 

• Genetic Regulatory Network (GRN). For 1 < i < n, Wj = (Wj(l), . . . , Wj(|?))' rep- 
resents the expression level of p different genes corresponding to the i-th patient. 

Approximately, Wi '~ N{a,T?), where the contrast mean vector a is sparse reflect- 
ing that only few genes are differentially expressed between a normal patient and a 
diseased one |30j . Frequently, the concentration matrix Q. = is believed to be 
sparse, and can be effectively estimated in some cases (e.g. ll|), or can be as- 
sumed as known in others, with the so-called "data about data" available [26J. To 
estimate a, one may consider Y = n~^/^ S"=i ~ N{^/na, S) and use brute-force 
thresholding. However, such an approach is inefficient as it neglects the correlation 
structure. Alternatively, let be a positive-definite estimate of fi, the problem can 
be re-formulated as the following linear model: {d)^/'^Y i}^/'^Y ~ N{Q^^'^j3, Ip), 
where (3 = \fna and G ~ il, and both are sparse. 



Other examples can be found in Computer Security [25] and Factor Analysis |21j . 

Well-known approaches to variable selection include subset selection, the lasso, SCAD, 
MC+, greedy search and more [IlElIIllIIllEIlEaESlEHlEHlllQl. While these approaches 
may exploit the signal sparsity effectively, they are not designed to take advantage of the 
sparsity of the graphical structure of the design variables. It is therefore of great interest to 
study how to exploit such graph sparsity to substantially improve variable selection. This 
is particularly important in the "rare and weak" paradigm, where it is so easy to be fooled 
by noise. 

In fact, in such a paradigm, even the 'optimal' penalized least squares methods (in- 
cluding exhaustive subset selection) are non-optimal. The exhaustive subset selection is 
non-optimal because it is a one-stage and non-adaptive method that does not fully utilize 



the graphical structure among the design variables. See Section 1.9 for detailed discussion 
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In this paper, we propose a new approach to variable selection which we call the Graphlet 
Screening (GS). This is a two-stage Screen and Clean method, where the main method- 
ological innovation is the use of a graph of strong dependence (GOSD), constructed from 
the Gram matrix, to guide both the screening and the cleaning processes. The procedure 
limits the attention to strong correlated substructures only, and has a two- fold advantage: 
modest computational cost and theoretic optimality. Below, we begin our discussion with 
the so-called phenomenon of interaction of signal sparsity and graph sparsity, which plays 
a central role in the proposed method. 



1.3 Sparse signal model, interplay of signal sparsity and graph sparsity 

Motivated by the above examples, we adopt a sparse signal model as follows (e.g., [5j). Fix 
parameters e G (0, 1) and r > 0. Let b = {bi, . . . , bp)' be the p x 1 random vector where 

6i ~ Bernoulh(e). (1.2) 

We model the signal vector /3 as 

/3 = 6o/i, (1.3) 

where o denotes the Hadamard product (i.e., for any pxl vectors x and y, xoy is the p x 1 
vector such that (x o y)^ = Xiyi, 1 < i < p), and // € 0p(t) with 

Spir) = {fi£RP : >T,l<i<p}. (1.4) 



In later sections, we may further restrict fi to a subset of 0p(r); see (1.11). 



Definition 1.1 We call { 1.2)-{ I.4) the Rare and Weak signal model fiiW{e,T, fj,)). 

In this model, /3j is either or a signal with at least strength r. The parameter e is 
unknown to us, but is presumably small so the signals are sparse. At the same time, we 
take T to be moderately large (see Section 1.5 for details) so that the signals are barely 
separable from the noise. This models a situation where the signals are both rare and weak. 

Naturally, a sparse Gram matrix induces a sparse graph among design vectors, which 
we call the graph of strong dependence (GOSD). Towards this end, write 

X = [xuX2,...,Xp] = [Xi,X2,...,X„]' (1.5) 

so that Xj is the j-th column of X and X'- is the i-th row of X. For a tuning parameter 
6 > {6 = 1/logp or other small values of logarithmic order), we introduce 

n* = {n*ii,j))p^p, n*{i,j) = G{i,j)i{\Gii,j)\ > 5}, (i.e) 

3jS Si regularized Gram matrix. 

Definition 1.2 The GOSD is the graph Q* = iy, E), where V = {1, 2, . . . ,p} and nodes i 
and j are connected if and only if 0,*{i,j) ^ 0. 

If each row of ft* has no more than K nonzeros, then the graph Q* is K-sparse. 

Definition 1.3 A graph Q = {V^E) is K-sparse if the degree of each node is no greater 
than K. 
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At first glance, it is unclear how the sparsity of Q* may help in variable selection. In 
fact, for any fixed node i, even when K is as small as 2, it is possible to have a very long 
path that connects node i to another node j. Therefore, it is unclear how to remove the 
influence of other nodes when we attempt to make inference about node i. 

However, on a second thought, we note that what is crucial to variable selection is 
not the graph Q* , but the subgraph of G* formed by all the signal nodes. Compared to 
the whole graph, this subgraph not only has a much smaller size, but also has a much 
simpler structure: It decomposes into many components, each of which is small in size, and 
different components are disconnected (a component is a maximal connected subgraph). 
The following notation is frequently used in this paper. 

Definition 1.4 Fixing a graph Q, we say Tq <]Q if Tq is a component of Q . 

In other words, due to the interplay between the signal sparsity and the graph sparsity, 
the original regression problem is decomposable: the signals live in isolated units, each is 
small in size (if only we know where they are!), and different units are disconnected to each 
other. So to solve the original regression problem, it is sufficient to solve many small-size 
regression problems parallely, where one problem has little influence over the others. 

Formally, denote the support of the signal vector by 

S = S{f3) = {l<i<p: Pj^O}. 

Let be the subgraph of G* formed by all the nodes in S. The following lemma is proved 
in Section m 

Lemma 1.1 Fixing K > 1, m > 1, e > 0, t > 0, suppose Q* is K-sparse and (3 is from the 
Rare and Weak model RW{e,T, iJ,). Then, with at least probability 1 — p{eeK)"^~^^ , Qg de- 
composes into many components, each has a size < m, and different ones are disconnected. 

For moderately sparse signals (e.g. in an asymptotic framework where as p — )• oo, e = < 
for some fixed parameter > 0), p{eeK)^^^ is small so that the decomposability in 
Lemma [l . 1 1 holds with overwhelming probability. We mention that Lemma 1.1 is not tied to 



Model (1.2)-(1.4) and holds in much broader settings. For example, a similar claim can be 
drawn if the vector 6 in /3 = 6o^ satisfies a certain Ising model |24|. The decomposability 
of Q*g is mainly due to the interplay of the signal sparsity and the graph sparsity, not 
the specific model of the signals. For further elaboration on this point, see the proof of 
Lemma ll. II in Section [5l 



1.4 Graphlet screening 

The aforementioned decomposability invites the following two-stage variable selection pro- 
cedure, which we call the Graphlet Screening (GS). Conceptually, the procedure contains 
a graphical screening step (G5-step) and a graphical cleaning step (GC-step). 

• GS-step. This is an m-stage x^-screening process, where m > 1 is a preselected 
integer. In this process, we investigate all connected subgraphs of Q* of no more than 
m nodes. For each of them, we test whether some of the nodes in the connected 
subgraph are signals, or none of them is a signal. We then retain all those which we 
believe to contain one or more signals. 

• GC-step. The surviving nodes decompose into many components, each of which has 
no more than Iq nodes, where io is a fixed small number. We then fit each component 
with penalized MLE, in hopes of removing all falsely kept signals. 
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In philosophy, the GS is similar to [341 [T5] in that they have a screening and a cleaning 
stage, but is more sophisticated in nature. 



We now describe two steps in details. Recalling (1.5), we have the following definition. 



Definition 1.5 For X in Model (1.1) and any subsetl C {l,2,...,p}, let = P^{X) he 



the projection from to the span of {xj,j S X}. 

Consider the GS'-step first. Let Q* be as in (1.6) and fix m > 1. The m-stage x^- 
screening is as follows. 

• Initial sub-step. Let U* = 0. List all connected subgraphs of G*, say Iq, in ascending 
order of the number of nodes \Io\, with ties broken lexicographically, subject to \Io\ < 
m. Since a node is thought of as connected to itself, the first p connected subgraphs 
on the list are simply the nodes 1,2, ... ,p. We screen all connected subgraphs in the 
order they are listed. 

• Updating sub-step. Let Iq be the connected subgraph under consideration, and let U* 
be the current set of retained indices. We update U* with a tests as follows. Let 
F = Iq nU* and D = Iq\U*, so that F is the set of nodes in Iq that have already 
been accepted, and D is the set of nodes in Iq that is currently under investigation. 
Note that no action is needed if = 0. For a threshold t{D, F) > to be determined, 
we update U* by adding all nodes in D to it if 

T(Y, D, F) = llP^oy f - \\P^Yf > t{D, F), (1.7) 

and we keep U* the same otherwise (by default, ||P^y|| = if F = 0). We continue 
this process until we finish screening all connected subgraphs on the list. 

In the GS-step, once a node is kept in any sub-stage of the screening process, it remains 
there until the end of the GS-step (however, it may be killed in the GC-step). This has a 
similar flavor to that of the Forward regression. See Table 1 for a recap of the procedure. 
The GS-step uses the following set of tuning parameters: 



Q = {t{D, F) : {D, F) are as defined in (|l.7|)}. 
A convenient way to set these parameters is to let t{D,F) = 2(7^(7 log p for a fixed g > 







and all {D,F). More sophisticated choices are given in Section 

The computational cost of the GS-step hinges on the sparsity of Q* . In Section 1.5 
we show that with a properly chosen 5, for a wide class of design matrices, Q* is -R'-sparse 
for some K = Kp < (log(p))° as p — )• oo, where a > is a constant. As a result, the 
computational cost of the GS-step is moderate, because for any /C-sparse graph, there are 
at most p{eK)"^ subgraphs with size m [19j. 

The GS-step has two important properties: Sure Screening and Separable After Screen- 
ing (SAS). With tuning parameters Q properly set, the Sure Screening property says that 
Up retains all but a negligible fraction of the signals. Viewing U* as a subgraph of G* , the 
SAS property says that this subgraph decomposes into many disconnected components, 
each has a size < io for a fixed small integer Iq. Together, these two properties enable us to 
reduce the original large-scale regression problem to many small-size regression problems 
that can be solved parallelly in the GG-step. See Section [2] for elaboration on these ideas. 

We now discuss the GG-step. The following notation is frequently used in this paper. 
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Table 1: Graphet Screening Algorithm 



G5-step: List ^*-connected submodels Xo,fc with |2o,i| < |2^o,2| < 



< m 



Initialization: U* 



and k = 1 



Test Hq : Xo,fc nW* against Hi : Xo,fc with x test (1.7) 



Update: U* 



if f^o rejected, k ■(^ k + 1 



GC-step: As a subgraph of Q*, U* decomposes into many components Xq 



Use the L^'-penalized test (1.8) to select a subset Xq of each Xq 



Return the union of Xn as the selected model 



Definition 1.6 For apxm matrix X and subsets Xq G {1,2, . . . ,p} and Jq C {1,2,..., m}, 
X^o.Jo denotes the |Xo| x |Jo| sub-matrix of X formed by restricting the rows of X to Xq 
and columns to Jq. For short, in the case where Jq = {1,2, . . . ,m}, we write it as 

X^o 

and in the case where Xq = {1, 2, . . . ,p}, we write it as X*"^° . When m = I, X is a vector, 
and X-^o is the sub-vector of X formed by restricting the rows of X to Xq. 

For any 1 < j < p, we have either j ^U*, or that there is a unique connected subgraph 
Xq such that j G Xq<\U* (see Definition 1.3 for the notation). In the first case, we estimate 
Pj as 0. In the second case, for two tuning parameters u^^ > and v^^ > 0, we estimate 
the whole set of var iables /3^o by minimizing the following functional: 

\\pxo^Y-x*'^^>of + {un'm\o. (1.8) 

Here, ^ is an \Xq\ x 1 vector each nonzero coordinate of which > v^^ in magnitude, and ||i^||o 
is the L^'-norm of ^. The resultant estimator is the final estimate of Graphlet Screening 
which we denote by = (39s (y-, 5, Q, u3',v9',X,p, n). 

The computational cost of the GC-step hinges on maximal size of the components oilA*. 



By the SAS property of the G5-step (see Lemma 2.3 for details), for a broad class of design 
matrices, with the tuning parameters chosen properly, there is a fixed integer Iq such that 
with overwhelming probability, \Xq\ < Iq for any Xq <\U*. As a result, the computational 
cost of the GG-step is no greater than \ly(* \ x 2^'^, which is moderate. 

How does Graphlet Screening behave? Surprisingly well. In sections below, we show 
that Graphlet Screening achieves the minimax Hamming errors over a wide class of Rare 
and Weak models. Towards this end, we invoke the random design model. The use of 
random design model is mainly for simplicity in presentation. In particular, it is much 
easier to elaborate the sparsity of the Gram matrix in a random design model than a fixed 
design model. On the other hand, the main results in this section can be translated to fixed 
design models with careful modification of the notation. See for example Corollary |1.1| and 
Section H 

1.5 Asymptotic Rare and Weak model for regression with random design 

We continue our discussion with the Rare and Weak model RW(e, r, fj.) by introducing an 
asymptotic framework. In this framework, we let p be the driving asymptotic parameter, 
and parameters (e, r) are tied to p through some fixed parameters. In detail, fixing < 
i9 < 1, we model 

6 = 6^=^-". (1.9) 
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For any fixed i?, the signals become increasingly sparser as p — t- oo. Also, as •& ranges, the 
sparsity level ranges from very dense to very sparse, and covers most interesting cases. 

It turns out that the most interesting range for r is r = Tp = 0{y/\og{p)). In fact, 
when Tp <^ (Ty^log(p), the signals are simply too rare and weak so that successful variable 
selection is impossible. On the other hand, when Tp is sufficiently large, it is possible to 
exactly recover the support of /3 under proper conditions on the design. In light of this, we 
fix r > and calibrate r by 

T = Tp = a^/2r\og{p). (1.10) 

At the same time, fixing a constant a > 1, in the RW{e,T, ^i), we further restrict the 
vector /i to a subset of 0p(rp), denoted by 6p(Tp, a), where 

Q*p{Tp,a) = {/U G 0p(Tp) : < aTp,i = 1,2, . . . (1.11) 

and the parameter a is unknown. The constraint of < aTp is mainly for technical reasons 



(only needed in the proof of Lemma 2.3). Hopefully, in the near future, such a constraint 



can be removed. See later part of this subsection for more discussion on the role of a. 



Definition 1.7 We call model (1.2)-(1.4) and ) the Asymptotic Rare Weak sig- 

nal model ARW(t?, r, a, //). 

We now introduce the random design model. Fix a correlation matrix Vt that is pre- 
sumably unknown to us (however, for simplicity, we assume that 17 has unit diagonals). In 
the random design model, we assume that the rows of X as iid samples from a p-variate 
zero means Gaussian vector with correlation matrix 

~ A^(0, -S7). (1.12) 

n 

The factor 1/n is chosen so that the diagonal elements of the Gram matrix G are approx- 
imately one. In the literature, this is called the Gaussian design, which can be found in 
Compressive Sensing [2], Computer Security [8], and other application areas. 
At the same time, fixing k G (0, 1), we model the sample size n by 

n = np=p'^. (1.13) 

As p — )• (X), Hp becomes increasingly large but is still much smaller than p. We assume 

K>(1-1?), (1.14) 



so that Up S> pep. Note pep is approximately the total number of signals. Condition (1.14) 
is almost necessary for successful variable selection [9l [10] . 



Definition 1.8 We call Model (1.12)-(1.14-) the Random Design model RD(i?, k, il). 



Come back to (1.11). From a practical point of view, it is preferable to assume a 
moderately large (but fixed) a, since we usually don't have sufficient knowledge on ^u. For 
this reason, we are primarily interested in the case where a is "appropriately" large. This 
will make Qp{Tp,a) sufficiently broad so that neither the minimax rate nor any variable 
selection procedure needs to adapt to a. 

Towards this end, we impose some mild "local" regularity conditions on 17. In detail, 
for any positive definite matrix A, let \{A) be the smallest eigenvalue, and let 

A|.(ll) = min{A(^) : A\s a, k x k principle submatrix of (1-15) 
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At the same time, fixing a constant cq > 0, let ('!9,'r) be as in (1.9) and (1.10), respectively, 
let m be as in the GS-step, and let g be the smallest integer such that 

g >max{m, (T9 + r)V(2r)}. (1.16) 

Introduce 

Aip{co,g) = {n : p X p correlation matrix, A^,(r2) > cq, 1 < A; < g}. 
For any two subsets Vq and Vi of {1, 2, . . . consider the optimization problem 

{0?\Vo,Vi),ei'\Vo,Vi)) = argmax{(0« - - ^W)}, (1.17) 

subject to the constraints that for k = 0,1, O^'^^ are p x 1 vectors satisfying \6^''^\ > 1 

for i G Vfc and 9^''^ = otherwise, and that the sign vectors of 6*^*^^ and O'^^^ are unequal. 
Introduce 

a*Jn)= max max{||^f (Fq, V^i)||oo, ||#(Vb, Fi)||oo}. 

{(yo,vi):|youVi|<s} 

We have the following lemma, the proof of which is elementary and thus omitted. 

Lemma 1.2 For any il. € Mp{co, g) , there is a constant C = C{cQ,g) such that a* {Q) < C. 

In this paper, unless stated otherwise, we assume 

neMp{co,g), a>a*g{n). (1.18) 

One exception is Theorem |1.1[ where the result holds without such a constraint. In Section 
we further restrict to a subset of M.p{co,g) to foster graph sparsity. Condition 
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(1.18) is mild for it involves only small-size principle sub-matrices of Q, and we assume 
a > a*{^}) mostly for simplicity. For insight, imagine that in (1.17), we further require that 

l^i*^^! < o,Tp, i G Vfc, = 0, 1. Then as long as a > a*{Q), the optimization problem in 



(1.17) has exactly the same solution, which does not depend on o. This explains (1.18). 

For any fixed /3 and any variable selection procedure /3, we measure the performance by 
the Hamming distance between the sign vectors sgn(/3) and sgn(/3): 



p 

hp0,l3\X) = E[Y^l{sgn0,)^sgn{f3j))\X 



In the Asymptotic Rare Weak model, /3 = 6 o ^, and (ep, Tp) depend on p through (t?, r), so 
the overall Hamming distance for /3 is 

Hp0; ep, Up, /i, Q) = E.^En [hp0, /3|X)] = E.^En [/ip(/3, b o /x|X)] , 

where E^^ is the expectation with respect to the law of b, and Eq is the expectation with 
respect to the law of X; see ( |1.2[ ) and ( |1.12 ). Finally, the minimax Hamming distance is 

Hamm*('!9, K, r, a, fi) = inf sup {-f/p(/3; e^. Tip, ^u, il)}. 

In the above definitions, sgn(a;) = 0, 1, —1 if j; = 0, a; > 0, and x < correspondingly. Note 
that the Hamming distance is no smaller than the sum of the expected number of signal 
components that are misclassified as noise and the expected number of noise components 
that are misclassified as signal. 
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1.6 Lower bound for the minimax Hamming distance, and GOLF 

In this section, we construct a lower bound for the minimax Hamming distance. We focus 
our discussion on the Random Design model, but will also address deterministic design. 
The main insight is that, the lower bound depends on the structure of small-size connected 
subgraphs of Q*, rather than the large-scale structures of G*. This is a direct result of the 
decomposability of Qg mentioned earlier. 

It is noteworthy that the constructed lower bound is useful provided that the so-called 
graph of least favorables (GOLF) (to be introduced below) is sparse. The GOLF can be 
sparse even if Q* is heavily non-sparse, so the lower bound can be useful in rather general 
context, with only minimal conditions imposed on Q. 



Below, we first derive such a lower bound without assuming (1.18). We then discuss the 



case where (1.18) holds, and derive a useful alternative expression for the lower bound. The 
key in the construction is to derive lower bounds for the so-called "local risk" . The "local 
risk" at an index j, 1 < j < is the risk of estimating the set of variables {/3fc : d{k, j) < g}, 



where g is defined in (1.16) and d{j,k) denotes the geodesic distance between j and k in 
the graph Q* . Aggregating the lower bounds for the "local risk" at different j gives a lower 
bound for the global risk, where the effect of repetitive counting is controlled by the sparsity 
of the GOLF. 

We now construct a lower bound for the "local risk" at site j. The "local risk" is 
characterized by the exponent p^{'&, r, a, Q), which depends on in a complicated way, and 
it takes relatively long preparations to describe it. 

In detail, for a lower bound for the "local risk" at site j, the goal is to construct two 
subsets Vq and Vi (maybe equal) of {1, 2, . . . ,p} and two realization of (3, (3^'^^ and I3^^\ 
such that their sign vectors are unequal, and 

• (a) j £ ViUVo, 

• (b) /3(0) and take same values at any k ^ VqU Vi, 

• (c) for any keVoU Vi, fSj^^ / if and only [fk£Vi,i = 0, 1. 

In the literature, it is known that how well we can estimate {/3fc : d{k,j) < g} depends on 
how well we can test the model Y = + az against the model Y = X/]^^^ + az, where 

z ~ N{0,ln)- These two models can be thought of as an original model and a tampered 
one, respectively. In the context of lower bound construction, we usually assume that 
and are known, but we don't know which of the two models is true. The least favorable 
scenario corresponds to the "worst-case" configuration of {Vq,Vi, f3^^\ (3^^'^), for which two 
regression models are the most difficult to separate. 

Now, for any subset V C {1, 2, . . . ,p}, let ly be the p x 1 vector such that {Iv)k = 1 if 
k €z V and otherwise, and let By be the set of vectors 

Bv = {d = Ivofi: fie e;(rp, a)}. 

Rewriting /3(i) - = 9^^^ - 6'(°) such that 6'(*) eBv,,i = 0, 1, we introduce 

a(0(o), ^(D) = r-2(0(o) - 0(i))'17(^(o) - 0(1)). 

In the aforementioned testing problem, it is seen that the optimal test is to reject the 
original model if and only if {d^ _ 0Wyx'{Y - X/3(0)) > taTpy^oJW^^W)) for some 
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threshold t > to be determined, and the sum of Type I and Type II error of any test is 
no smaller than (up to some negligible differences) 

inf [ej,^o I i (t ) + e|,^i I ^> (t - (Tp/a) [a(0(°\ )] ] . (1.19) 

This gives a lower bound for the "local risk" at index j. Here ^> = 1 — $ is the survival 
function of N{0, 1). In the above calculations, we have used that in the Random Design 
model, 

(^(0) _ 0(i))'G(e(o) - 0(1)) « (^W - 0(i))'f)(eW - 0(1)), 

since the support of 0(^) — 9^^^ is contained in Vq U Vi, which is a small-size set. 

To tighten the lower bound in (1.19), we look for the "worst-case" configuration of 
(Vb, Fi, ^C^), 0(1)). Towards this end, first, we fix (Vo,Vi) and look for the "worst-case" 
(0(0), 0(1)). To do so, define a*{Vo,Vi;n) = a*{Vo,Vi;^,r,a,n,p) by 

a*(yo,^i;f^) =min{a(0(o),0(i)) : 0(^) = Iv.o;x(^), ^(^) g e;(rp, a), i = 0, 1, sgn(0(o)) / sgn(0(i 



where sgn(0) denotes the sign vector of 9. By (1.19) and monotonicity, a lower bound for 
the "local risk" at index j is then 

sup {inf[el^ol^(i) + 6l^^l$(t-(rp/a)K(l^o,^i;^^)]'/')}. (1-20) 



Next, we find the "locally worst-case" (Vb,Vi) for (1.20). The following shorthand 
notation is frequently used in this paper, which stands for a generic multi-log(p) term that 
may vary from one occurrence to another. 

Definition 1.9 Lp > denotes a multi-\og{p) term such that when p — )• oo, for any 5 > 0, 

LpP^ — 5- oo and Lpp^^ — )• 0. 

Now, introduce r]{Vo, Vi; J7) = r]{Vo, Vi; i?, r, a, ^,p) by 

1 



rj{Vo,Vi;n)=nmx{\Vo\,\Vi\}'&+ ^ 



Va*(Vb, Vi;iZ)r 



Recalling Tp/cr = \J2r logp and tp = p = exp(— -iJlogp), by Mills's ratio the lower bound 
(1.20) can be equivalently written as 



Lpcxp - min r](Vo,Vi;n)logp 
In light of this, for any 1 < j < p, the least favorable configuration at index j is the triplet 

p*{^,r,a,n)= min 7]{Vo,Vi;n), (1.21) 
■' {(Vo,Vi):jeViuyo} 

and 

(Kj^^ij) = ^^s^^^{{Vo,viy.j&viuvo}^(yo, Vi; 

When there is a tie, pick the pair that appears first lexicographically. Therefore, for any 
1 < j < Vqj U V^j is uniquely defined. The following lemma is proved in Section j5| 

Lemma 1.3 Fixp, -d, r, n, and 1 < j < p, max{|Vo)- U V*j\} < (?? + r)V(2i?r). 



11 



So far, we have been focused on the "worst-case" of the "local risk", which is charac- 
terized by Lpexp(— r, a, fi) log(p)). In order for the sum of "local risk" to reflect the 
global risk, we need the sparsity of the following graph. 

Definition 1.10 The graph of least favorables (GOLF) is the graph = (y,E), where 
V = {1,2, . . . ,p} and nodes j and k are connected if and only if (Fq* U V^* ) and (Fq*^ U V^j^) 
have non-empty intersections. 

Since k e {Vqj^ U V*,^), j and k are connected ii k e {V^j U V^j). Let dp{g^) be the maximum 
degree of Q^. When dp{Q^) is relatively small, the least favorable configurations do not 
contain a hub: that is, any node k can appear in (Voj U V*^) for relatively few j. Such 
a property holds for a broad class of fi, including even those ill-posed ones. In this very 
general context, we have the following theorem, which establishes the lower bound. 

Theorem 1.1 As p ^ oo, Hamm;(??, k, r, a, f)) > [rfp(a'')] Ei=i P"''^*^'^'''''''^^ ^he 
Random Design model RD('d, k, 0). 

A similar conclusion can be drawn for deterministic design models, the proof of which is 
similar so we omit it. 

Corollary 1.1 For deterministic design models, the parallel lower bound holds for the min- 
imax Hamming distance with J7 replaced by G in the calculation of p*^ and dp{Q^). 



In Theorem 1.1, we have not imposed much conditions on f]. In the remaining part 
of this subsection, we assume (1.18) holds. In this case, pj(i9,r, a, 0) does not depend on 
a, and have an alternative expression. In detail, for any subsets D and F of {1, 2, . . . 
define oj{D, F; Q) = lo{D, F; r, a, n, p) by 

uj{D,F-^)= min {^'(O^'^ - 0^'^(J7^'^)-il7^'^)^ j, (1.22) 

and p{D,F;Vt) = p{D,F;^,r,a,n,p) by 

, {\D\ + 2\F\)'d ( l^{D,F;n)r, is even, 

p(D,f;n) = + 1 | + l[(v'^IWi)^-^5J^)J^ |o|.sodd. 



The following lemma is proved in Section [5j 



:i.23) 



Lemma 1.4 Fix 'd,r,g,a,Vt,CQ. If (1.18) holds, then p*{'d,r,a,H) does not depend on a, 
and satisfies 

p*(^,r,a,n)= min p(D,F;n). 

{{D,F):j(iDVjF,DnF=%,D^%,\DVjF\<g} 



Lemma 



1.4 



provides an alternative way to calculate /Oj(t?, r, a, ft), and is particularly useful 



in proving Corollaries 1.2p.4 



1.7 Upper bound and optimality of Graphlet Screening 

We are now ready for the main result of this paper. When il. is sparse — so that the GOSD Q* 
is sparse — the Hamming distance of Graphlet Screening achieves the rate of convergence 
prescribed by the lower bound we derived in the preceding section, provided the tuning 
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parameters are properly set. Therefore, the lower bound is tight and Graphlet Screening is 
rate optimal. 

In detail, fix constants 7 G (0, 1) and ^4 > 0. To foster sparsity of the Gram matrix, we 
shift our attention from M.p(cQ,g) (where only a lower bound cq for the lower eigenvalue of 
dimension g is imposed) to the following subset: 

p 

M;{-f,co,g,A) = {nG Mp{co,g) : ^|J7(i, j)^ < A 1 < ^ < p}- (1-24) 

i=i 

Note that any Q £ M*{'y, co,g, A) is sparse in the sense that each row of O has relatively 
few large coordinates. The sparsity of O implies the sparsity of the Gram matrix G, since 
small-size sub- matrices of G approximately equal to their counterparts of Q. 

In Graphlet Screening, when we regularize GOSD, we set the threshold 5 as l/log(p); 
see (1.6). Such a choice for threshold is mainly for convenience, and can be replaced by any 
term that tends to logarithmically fast as p — )• 00. 

We choose the tuning parameters in the GS'-step in a way such that 



t{D,F)=2a^q{D,F) logp, 
where q = q{D, F) > are chosen as follows: For fixed qo > 0, 



:i.25) 



go < < 

< y/q < ^/tJr ■ 



(i9+a;r)2 
1 



\D\ is odd & ojr/^ > \D\ + (|l)|2 - 1)V2^ 



\D\ is even & ur/ii) > 2\D\ 



and q can be any other number no smaller than qo in other cases. Here uj = u;{D, F; Q) are 
defined in the same way as (1.22). 



We set the GC-step tuning parameters by 



a 



\/2i?log 



P, 



cj\/27k) 



gp. 



[1.27) 



The main result in this paper is the following theorem. 



Theorem 1.2 Consider Model (1.1) where j3 is modeled by ARW(t9, r, a, /i) and X is mod- 
eled by RD(7?, K, Q). Suppose that for sufficiently largep, a > a*g{U) and Q G A1*(7, cq, g, A). 
Let {i^" = P^'^{Y;6, Q,u^^ ,v^^ , X,p,n) be the Graphlet Screening procedure defined in Sec- 
tion 1.3. If the tuning parameters are set as in (1.25)-(1.21), then as p ^ 00, 

+ 0(1). 



sup Hp{P'^^; ep, Up, p, 0) < Lp 
Mee;(rp,a) 



Note that p'j = p*j{'&, r, a, 17) does not depend on a. Also, note that in the most interesting 
range, '^^=iP~^^ 3> 1. So if we choose m properly large (e.g. (m + !)-& > 1), then 



sup Hp{$s';ep,np,p,n) < J^p-^.*^'^''^''^'^). 



Together with Theorems |1.1[ this says that Graphlet Screening achieves the optimal rate of 
convergence, adaptively to all in cq, g, A) and /3 E 0p(Tp, a). We call this property 
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optimal adaptivity. Note that since the diagonals of il. are scaled to 1 approximately, 
K = log(np)/log(p) does not have a major influence over the convergence rate, as long as 



(|1.14D holds. 

addresses the case where o > a*{Q,). We now briefly discuss the case 



Theorem 



1.2 



where a < a* (J 2). In this case, the set 0*(Tp,a) becomes sufficiently narrow and a starts 
to have some influence over the optimal rate of convergence, at least for some choices of 
(i?, r). To reflect the role of a, we modify Graphlet Screening as follows: (a) in the GC-step 



(1.8), limit ^ to the class where either = or Tp < < ar^, and (b) in the GS-step, 
replacing the x^-screening by the likelihood based screening procedure; that is, when we 
screen Iq = D L) F, we accept nodes in D only when h{F) > /i(Xo), where for any subset 
Dc{l,2,...,p}, 

h{D)= min ^||P^(y - + Mj')!^!- 

From a practical point of view, this modified procedure depends more on the underlying 
parameters and is harder to implement. However, this is the price we need to pay when a is 
small. Since we are primarily interested in the case of relatively larger a (so that a > a*{0,) 
holds), we skip further discussion along this line. 

1.8 Phase diagram and examples where p*('&,r, a,n) have simple forms 

In general, the exponents p*{'d,r,a,Q) may depend on 17 in a complicated way. Still, from 
time to time, one may want to find a simple expression for ("i?, r, a, 0). It turns out that in 
a wide class of situations, simple forms for p*(t?, r, a,0) are possible. The surprise is that, 
in many examples, p'j{'d,r,a,Q) depends more on the trade-off between the parameters 
and r (calibrating the signal sparsity and signal strength, respectively), rather than on the 
large coordinates of fi. 

We begin with the following theorem, which is proved in [25, Theorem 1.1]. 

Theorem 1.3 Fix -d G (0, 1), r > 0, and a > 1. Then for any correlation matrix 

Hamm*(i?, K, r, a, Jl) ^ / 1, < r < ■!?, 



Note that p^~'^ is approximately the number of signals. Therefore, when r < -d, the number 
of selection errors can not get substantially smaller than the number of signals. This is the 
most difficult case where no variable selection method can be successful. 

In this section, we focus on the case r > i?, so that successful variable selection is 



possible. In this case. Theorem 1.3 says that a universal lower bound for the Hamming 
distance is 

^^^l_(^+,)2/(4r)_ 

An interesting question is, to what extend, this lower bound is tight. 

Recall that A|.(r2) denotes the minimum of smallest eigenvalues across all kxk principle 
submatrices of fi, as defined in (1.15). The following corollaries are proved in Section [5j 



Corollary 1.2 Fix i9 G (0, 1) and r > such that 1 < r/i? < 3 + 2^2 5.828. Suppose 



that in addition to the conditions of Theorem 1.2. \Q,{i,j)\ < 4\/2 — 5 ~ 0.6569, for all 
l<i,j <P, i + 3, t/ienHamm;(i?,K,r,a,17) = Lppi-('^+^)'/(^^). 
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Corollary 1.3 Fix ■& G (0, 1) and r > such that 1 < r/i) < 5 + 2\/6 f« 9.898. Suppose 
that in addition to the conditions of Theorem 1.2, A3 > 2(5 — 2\/6) ~ 0.2021, A| > 5 — 
2\/6 w 0.1011, and for all 1 < i,j < p, i j, \^{i,j)\ < 8\/6 - 19 ^ 0.5959, then 
Hamm;(t?, k, r, a, J^) = Lppi-('^+'^)'/(4r) . 

The conditions in these corollaries are rather relaxed. Somewhat surprisingly, the off- 
diagonals of Q do not necessarily have a major influence on the optimal rate convergence, 
as one might have expected. 






Figure 1: Phase diagram for Vt = Ip (left), for Vt satisfying conditions of Corollary 1.2 



(middle), and for satisfying conditions of Corollary 1.3 (right). Red line: r = {}. Solid 
red curve: r = p(??, Q). In each of the last two panels, the blue line intersects with the red 
curve at (??,r) = (1/2, [3 + 2^/2]/2) (middle) and (i?,r) = (1/3, [5 + 2V6]/3) (right), which 
splits the red solid curve into two parts; the part to the left is illustrative for it depends on 
$7 in a complicated way; the part to the right, together with the dashed red curve, represent 
r = (1 + \/l — -i?)^ (in the left panel, this is illustrated by the red curve). 



Together, Theorem 1.3 and Corollaries 1.2 1.3 have an interesting implication on the so 



called phase diagram. Call the two-dimensional parameter space {("i?, r) : < "i? < 1, r > 0} 
the phase space. There are two curves r = 1} and r = /3('!?, ft) (the latter can be thought 
of as the solution of Y^P^^p~Pj^'^'^'"-'^^ = l; recall that p*j{'&, r, a, i7) does not depend on a) 
that partition the whole phase space into three different regions: 



Region of No Recovery. {(i?, r) : < r < < < 1}. In this region, as p — t- 00, 
for any Q and any procedures, the minimax Hamming error equals approximately to 
the total expected number of signals. This is the most difficult region, in which no 
procedure can be successful in the minimax sense. 

Region of Almost Full Recovery. {(??, r) : < r < p{^,0,)}. In this region, as p — t- 00, 
the minimax Hamming distance satisfies 1 <C Hamm*(??, k, r, a, 0) <C p^~^ , and it is 
possible to recover most of the signals, but it is impossible to recover all of them. 

Region of Exact Recovery. In this region, as p — t- 00, the minimax Hamming distance 
Hamm* K,r, a, 0,) = o(l), and it is possible to exactly recover all signals with 
overwhelming probability. 

In general, the function /)(??, fi) depends on Q in a complicated way. However, by Theorem 
1.3 and Corollaries |1.2|fL3l we have the following conclusions. First, for all Q and a > 1, 

p{i9, n)>{l + Vl-'&f, for ah < 1? < 1. 
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Second, in the simplest case where O = Ip, Hamm*('!9, k, r, a, Q) = Lpp^ (i9+r)2/(4r)^ 



p{^,n) = {i + Vi^f, 

Third, under the conditions of Corollary |1.2[ 

p{i^,n) = {i + Vi^f, 

Last, under the conditions of Corollary |1.3[ 

p{i^,Q) = (1 + ^/1^)2, 



for all < 1? < 1. 



if 1/2 < < 1, 



if 1/3 <^< 1. 



The phase diagram for the last three cases are illustrated in Figure [T] The blue lines are 
r/^ = 3 + 2V2 (middle) and r/i? = 5 + 2\/6 (right). 

Corollaries [1.2 1.3 can be extended to more general situations, where rf-d may get 
arbitrary large, but consequently, we need stronger conditions on il.. Towards this end, we 
note that for any (■d, r) such that r > ??, we can find a unique integer = N{i!), r) such 
that 

2N -l<{'d/r + r/^) /2<2N + 1. 



Suppose that for any 2 < k < 2N — 1 , 

r (r/t? + ^/r)/2 -2j + 2 + y/[ir/'d + 'd/r)/2 - 2j + 2]2 - 1 



At(ri) > max 

{(A;+l)/2<i<min{fc,Ar}} 



and that for any 2 < k < 2N, 



max 

{fc/2<i<min{fc-l,Ar}} 



-hN}} i 



{2k~2j + l){r/-d) 



{r/{> + §/r)/2 + l- 2j 
{k-j){r/^) 



}■ 



:i.28) 
:i.29) 



Then we have the following corollary. 



Corollary 1.4 Fix ■& G (0,1) and r > such that r > Suppose 11.2^41.2^ hold, 



additional to the conditions of Theorem 1.2, then Hamm*('!9, r, a, i7) = Lpp^ (i7+rp/(4r)_ 



Corollary |1 .4| implies a similar partition of the phase diagram as do Corollaries 1.2 1.3, say, 
that ( 1.28 )-( 1.29) hold for all {'d,r) satisfying r/i) < sq, for some fixed constant sq > 0. 



1.9 Non-optimaility of subset selection and the lasso 

Subset selection (also called the L'^-penalization method) is a well-known method for vari- 
able selection, which selects variables by minimizing the following functional: 

^\\Y-Xpf+^^{Xssfm\o, (1.30) 

where \\P\\q denotes the L'?-norm, q > 0, and Ass > is a tuning parameter. The AIC, 
BIC, and RIC are methods of this type [il|3Tl[16]. Subset selection is believed to have good 
"theoretic property" , but the main drawback of this method is that it is computationally NP 
hard. To overcome the computational challenge, many relaxation methods are proposed, 
including but are not limited to the lasso [3 [32], SCAD [U], MC+ [36], and Dantzig selector 
[B]. Take the lasso for example. The method selects variables by minimizing the following 
functional: 

hY-Xl3f + Xiassom\i, (1.31) 
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where the L^-penahzation is replaced by the L^-penahzation, so the functional is convex 
and the optimization problem is solvable in polynomial time under proper conditions. 

Somewhat surprisingly, subset selection is generally rate non- optimal in terms of selec- 
tion errors. This sub-optimality of subset selection is due to its lack of flexibility in adapting 
to the "local" graphic structure of the design variables. Similarly, other global relaxation 
methods are sub-optimal as well, as the subset selection is the "idol" these methods try 
to mimic. To save space, we only discuss subset selection and the lasso, but a similar 
conclusion can be drawn for SCAD, MC+, and Dantzig selector. 

For mathematical simplicity, we illustrate the point with an idealized regression model 
where the Gram matrix G = X'X is diagonal block-wise and has the following form 

G{i,j) = l{i = j} + ho ■ - i| = 1, max{i,j) is even}, \ho\ < 1, l<i,j<p. (1.32) 

Using an idealized model is mostly for technical convenience, but the non-optimality of 
subset selection or the lasso holds much more broadly than what is considered here. Since 
our goal is to show such methods are non-optimal, using a simple model is sufficient: if a 
procedure is non-optimal in an idealized case, we can not expect it to be optimal in a more 
general context. 

At the same time, we continue to model /3 with the Asymptotic Rare and Weak model 
ARW('(?, r, a, /i), but where we relax the assumption oi fi £ @*{Tp,a) to that of /i G Qp{Tp) 
so that the strength of each signal > Tp (but there is no upper bound on the strength). 
Consider a variable selection procedure (3*, where * = gs, ss, lasso, representing Graphlet 
Screening, subset selection, and the lasso (where the tuning parameters for each method 
are ideally set; for the worst-case risk considered below, the ideal tuning parameters depend 
on (i?, r,p, ho) but do not depend on /j,). For some exponents = p*('i?, r, ho) that does not 
depend on p, it is seen that for large p, the worst-case Hamming selection error of /3* has 
the form of 

sup Hp0*■,ep,^^,G) = Lpp'~P*^''''''^^\ 
{^eepCrp)} 



Here, Hp is slightly different from that in Section 1.5 since the settings are slightly different. 



We now study /?^(i?, r, ho). Towards this end, we first introduce 

Pfllo(^^r,ho) = {{2\ho\)-Hil-hl)V^- ^J{l-hl){l-\ho\)'r-4\ho\{l-\hoM}\ 
and 

p!1o(^> r, ho) = [(1 + \ho\)Vr- ^ {1 - \ho\)^r - 4\ho\^/il - h^)]' . 

We then let 

2i9, r/t? < 2/(1 - hi) 



PssyP^r,no) ^ [2^+(i_/j2)^]2/[4(i_/^2y]^ > 2/(1 - /ig) ' 

p(2)(^r/.n) = l . r/^<2/(l-|/.o|) 

' ' °^ I 2y2{l-\ho\)r-^{l-\ho\)r-^]\ r > 2/ {I - \ho\) 



(1) , N_ J 2^' r/^<2/{l-\ho\) 



2 



pLLo(^'^'^o) = < (3) 



pLL(^'^M, r/i)>2/{l-\h 



01 



\2 ' 
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and 



.(4) 



The following theorem is proved in Section [5} 



r/^?<(l + |/io|)/(l-|/io|)=^ 
r/^>il + \ho\)/{l-\ho\f • 



Theorem 1.4 Fix £ (0, 1) and r > such that r > ■&. If G satisfies (1.32), then 



^ nnn((^, (1^. 2.. ^^^^ 7. ^^^^ ^ d-) 



4(1 



Pssi'&,r,ho) = min{ 



4r 



r,p«(7?,r,/io),pf;(^,r,/io)}, (1.34) 



and 



Pza.so(i?,r,/Jo) =min{^^^±:^, z?+ JJi;=-, p;iL(^?, r, /io), £so(^> ^o)}- 



4r 



2(1 + yr^) 



:i.35) 



It can be shown that pgs{i},r, ho) > /9ss(??, r, /iq) > Piasso{^,r, ho), where depending 
on the choices of (t?, r, /iq), we may have equality or strict inequality (note that a larger 
exponent means a better error rate). This fits well with our expectation, where as far as the 
convergence rate is concerned, Graphlet Screening is optimal for all (T?,r, /iq), so it beats 
the subset selection, which in turn beats the lasso. Table [2] summarizes the exponents for 
some representative I'd, r, ho). It is seen that differences between these exponents become 
increasingly prominent when ho increase and "i? decrease. 



^/r/ho 


.1/11/.8 


.3/9/.8 


.5/4/.8 


.1/4/.4 


.3/4/.4 


.5/4/.4 


.1/3/.2 


.3/3/.2 


* = gs 

-k = SS 

•k = lasso 


1.1406 
0.8409 
0.2000 


1.2000 
0.9047 
0.6000 


0.9000 
0.9000 
0.7500 


0.9907 
0.9093 
0.4342 


1.1556 
1.1003 
0.7121 


1.2656 
1.2655 
1.0218 


0.8008 
0.8007 
0.6021 


0.9075 
0.9075 
0.8919 



Table 2: The exponents /0*(t?, r, ho) in Theorem 1.4, where ★ = gs, ss, lasso. 



Similar to that in Section 1.8, each of these methods has a phase diagram, where the 
phase space partitions into three regions: Region of Exact Recovery, Region of Almost Full 
Recovery, and Region of No Recovery. Interestingly, the separating boundary for the last 
two regions are the same for three methods, which is the line r = The boundary that 
separates the first two regions, however, vary significantly for different methods. For any 
ho G (—1, 1) and * = gs, ss, lasso, the equation for this boundary can be obtained by setting 
^^,(19, r, ho) = 1 (the calculations are elementary so we omit them). Note that the lower the 
boundary is, the better the method is, and that the boundary corresponding to the lasso 
is discontinuous at = 1/2. Compare the phase diagrams in Figure 2. 

Subset selection and the lasso are rate non-optimal for they are so-called one-step or 
non-adaptive methods |25| . which use only one tuning parameter, and which do not adapt 
to the local graphic structure. The non-optimality can be best illustrated with the diagonal 
block-wise model presented here, where each block is a 2 x 2 matrix. Correspondingly, we 
can partition the vector /3 into many size 2 blocks, each of which is of the following three 
types (i) those have no signal, (ii) those have exactly one signal, and (iii) those have two 
signals. Take the subset selection for example. To best separate (i) from (ii), we need to 
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set the tuning parameter ideally. But such a tuning parameter may not be the "best" for 
separating (i) from (iii). This explains the non-optimality of subset selection. 

Seemingly, more complicated penalization methods that use multiple tuning parameters 
may have better performance than the subset selection and the lasso. However, it remains 
open how to design such extensions to achieve the optimal rate for general cases. To save 
space, we leave the study along this line to the future. 
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Figure 2: Phase diagrams for Graphlet Screening (top left), subset selection (top right), 
and the lasso (bottom; zoom-in on the left and zoom-out on the right), where /iq = 0.5. 



1.10 Summary 

We propose Graphlet Screening as a new approach to variable selection. The key method- 
ological innovation is to use the GOSD to guide the multivariate screening. While a brute- 
force m-variate screening has a computational cost of 0{p"^), Graphlet Screening only has 
a computation cost of Lpp, by utilizing graph sparsity. 

We use asymptotic minimaxity of the Hamming distance as the criterion for assessing 
optimality. Compared with existing literature on variable selection where the oracle prop- 
erty or probability of exact support recovery is used to assess optimality, our approach is 
both mathematically more demanding and scientifically more relevant. 

We have proved that Graphlet Screening achieves the optimal rate of convergence of 
Hamming errors, especially when signals are rare and weak, provided that the Gram matrix 
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is sparse. Somewhat surprisingly, the well-known methods of subset selection and the lasso 
are not rate optimal, even with very simple Gram matrix G and even when the tuning 
parameters are ideally set. The sub-optimality of these methods is due to that they do not 
take advantage of the 'local' graphical structure as Graphlet Screening does. 

The GS methodology has three key tuning parameters: the parameter q for the threshold 
level t{D,F) = 2cr^glogp in the G5-step, and {u^'^^v^'^) = {ayj2'd\ogp, ayj2r\ogp) in the 
GC-step. While the choice of q is reasonably flexible and a sufficiently small fixed g > 
is usually adequate, the choice of u^'^ and v^^ are more directly tied to the signal sparsity 
and signal strength. Adaptive choice of these tuning parameters is a challenging direction 
of further research. One of our ideas to be developed in this direction is a subsampling 
scheme similar to the Stability Selection |28] . On the other hand, as shown in our numeric 
results in Section |3j the performance of the GS is relatively insensitive to mis-specification 
of( ep,Tp); see details therein. 



1.11 Content 

The remaining part of the paper is organized as follows. In Section [2j we introduce the 
so-called Sure Serening property and the Separable After Screening property of Graphlet 
Screening, and use these two properties to prove Theorem |1.2[ Section [3] contains numeric 
results. Section [4] discusses more connections to existing literature and possible extensions 
of Graphlet Screening, and Section [5] contains technical proofs. 

For a vector ^, ||^||g denotes the L'^-norm, and when q = 2, we drop q for simplicity. 
For a p X p matrix A, ||^||oo denotes the matrix L°°-norm, and ||^|| denotes the spectral 
norm [22j. Also, Q* is the graph of strong dependence (GOSD), Qg is its subgraph formed 
by all nodes in 5 = S{/3), and G'^ is the graph of least favorables (GOLF). 



2 Properties of Graphlet Screening, proof of Theorem 1.2 



Graphlet Screening attributes the success to two important properties: the Sure Screening 
property and the Separable After Screening (SAS) property. 

The Sure Screening property means that in the m-stage screening, by picking an 
appropriate threshold, the set U* (which is the set of retained indices after the GS-step) 
contains all but a small fraction of true signals. Asymptotically, this fraction is comparably 
smaller than the minimax Hamming errors, and so it is negligible. The SAS property means 
that except for a negligible probability, as a subgraph of the GOSD, U* decomposes into 
many disconnected components of the GOSD, where the size of each component does not 
exceed a fixed integer. Together, these two properties ensure that the original regression 
problem reduces to many small-size regression problems, and thus pave the way for the 
GC-step. 

Below, we explain these ideas in detail, and conclude the section by the proof of Theorem 



1.2 Since the only place we need the knowledge of a is in setting the tuning parameters, 
so without loss of generality, we assume a = 1 throughout this section. 

First, we discuss the GS'-step. For short, we write /3 = f3^'^{Y; 6, Q,u^^ ,v^^ , X,p,n) 
throughout this section. We first discuss the computational cost of the G5-step. As in 



Theorem 1.2, we take the threshold 6 in the definition of G* (see (1.6)) to he 6 = 6p = 
l/log(p). The proof of the following lemma is similar to that [25:, Lemma 2.2], so we omit 
it. 
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Lemma 2.1 As p ^ oo, for any Q G Adp{'y,co,g,A), with probability 1 — o(l/p^), \\Q 
S^lloo < C(log(p))-(i-'^), and Q* is K-sparse, where K < C(log(p))i/T. 



Combining Lemma 2.1 and [19], it follows that with probability 1 — o(l/p^), Q* has at most 

p{Ce{log{p))'/^r 

connected subgraphs of size < m, where we note that the second factor is at most logarith- 
mically large. Therefore, the computational cost in the G5-step is at most Lpp flops. 

We now consider the performance of the GS-step. The goal of this step is two-fold: 
on one hand, it tries to retain as many signals as possible during the screening; on the 
other hand, it tries to minimize the computational cost of the GC-step by controlling the 
maximum size of all components of U*. The key in the GS'-step is to set the collection 
of thresholds Q. The tradeoff is that, setting the thresholds too high may miss too many 
signals during the screening, and setting the threshold too low may increase the maximum 
size of the components in U*, and so increase the computational burden of the GC-step. 
The following lemma characterizes the Sure Screening property of GS, and is proved in 
Section [H 

Lemma 2.2 (Sure Screening). Fix m > 1, ^ > 0, r > 0, (t?, 7) € (0, 1)^ and k > 1 — 3 ^- In 
the m-stage screening of the GS-step, if we set the thresholds t{D,F) as in (1.25) and 



the conditions of Theorem 1.2 hold, then as p ^ 00, for any $7 G M*{'y, co,g, A) 



Y,Pii3j / 0, j ^ u;) < Lp[pi-(-+^)'^ + J^p-"^*] + 0(1). 

Next, we formally state the SAS property. Viewing it as a subgraph of Q*, U* decom- 
poses into many disconnected components X^^\ ^ k < N , where is an integer that 
may depend on the data. 

Lemma 2.3 (SAS). Fix m > 1, ^ > 0, r > 0, (t?,7) € (0, 1)^ and k > 1 - -d. In t he m- 
stage screening in the GS-step, suppose we set the thresholds t{D,F) as in (1.25) such 



that q{D,F) > qq for some constant qo = qo{'d,r) > 0. As p ^ 00, under the conditions of 
1.2. for any Q G 7W*(7, cq, 5, A), there is a constant £0 = io{i}, r, k, 7, A, cq, g) > 



Theorem 
such that 



with probability at least 1 — o{l/p), 

\I^^^\<io, l<k<N. 

We remark that a more convenient way of picking q is to let 

qo<q< i'^?u;r, \D\ is odd & ojr/^ > \D\ + i\D\^ - 1)1/2, ^^^^^^ 

90 < 9 < j'^^'j \D\ is even & urf-d > 2\D\, 

and let q be any other number otherwise, with which both lemmas continue to hold with 
this choice of q. Here, for short, oj = u){D, F;i}). Note that numerically this choice is 
comparably more conservative. 

Together, the above two lemmas say that the GS-step makes only negligible false non- 
discoveries, and decomposes U* into many disconnected components, each has a size not 
exceeding a fixed integer. As a result, the computational cost of the following GG-step is 
moderate, at least in theory. 
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We now discuss the GC-step. The key to understanding the GC-step is that the original 
regression problem reduces to many disconnected small-size regression problem. To see the 
point, define Y = X'Y and recall that G = X'X. Let Xq <\IA* be a component, we limit 
our attention to Xq by considering the following regression problem: 

ylo = qIq p + [X'zf'' , (2.37) 

where {X' ~ iV(0, G^O'^o) 7V(0, il^o-^o), and G^" is a |Xo| x P matrix according to our 
notation. What is non-obvious here is that, the regression problem still involves the whole 
vector /3, and is still high-dimensional. To see the point, letting V = {1, 2, . . . ,p} \ Z//*, we 
write 

G^o^ = G^°'^''^^° + I + II, 1= G^O'^o/3^0, II = G^'''^p^. 

Jo ■Jo<l^p, Jo T^Io 

First, by Sure Screening property, contains only a negligible number of signals, so we 
can think // as negligible. Second, for any j7o 7^ Xq and Jq <\U*, by the SAS property, Xq 
and are disconnected and so the matrix G"^"'"^" is a small size matrix whose coordinate 



are uniformly small. This heuristic is made precise in the proof of Theorem 1.2 It is now 
seen that the regression problem in ( |2.37| ) is indeed low-dimensional: 

yXo ^ Qio,io0ro ^ ^x'zf'' ^ iV(S7^0'^o/3^o, J]^0'^0), (2.38) 



The above argument is made precise in Lemma |2.4[ see details therein. Finally, approxi- 
mately, the GG-step is to minimize 



where each coordinate of ^ is either or > v^^ in magnitude. Comparing this with (2.38), 
the procedure is nothing but the penalized MLE of a low dimensional normal model, and 
the main result follows by exercising basic statistical inferences. 

We remark that in the GG-step, removing the constraints on the coordinates of ^ will 
not give the optimal rate of convergence. This is one of the reasons why the classical subset 
selection procedure is rate non-optimal. Another reason why the subset selection is non- 
optimal is that, the procedure has only one tuning parameter, but Graphlet Screening has 
the flexibility of using different tuning parameters in the GS'-step and the GG-step. See 
Section [L9] for more discussion. 



We are now ready for the proof of Theorem 1.2 
2.1 Proof of Theorem O 



By Lemma 2.2 



p 



^P(/3, / 0,i ^ U;) < L>i-(-+i)'' + ^p-'^ + o(l). (2.39) 
j=i j=i 



So to show the claim, it is sufficient to show 

p p 



Y,P{j G l^;,sgn{f3j) + sgn(/3,)) < L^iJ^p"^? + ^(i). (2.40) 

i=i i=i 



22 



Towards this end, let S{(3) be the support of /3, Q* be as in (1.6), and Q* be the GOSD. 
Let Up be the set of retained indices after the GS-step. Note that when sgn(/3j) ^ 0, there 
is a unique component Xq such that j £ Iq <\U*. For any connected subgraph Xq of Q* , let 

B{Io) = {k: Xo, Vt*{k4) / for some £ G Xq, 1 < A; < p}. 



as any node in 
(2.41) 

where the first summation is over all connected subgraphs that contains node j. By Lemma 



Note that when Xq is a component of U*^ we must have B{Xq) n U* - 
B{Xq) is connected to some nodes in the component Xq. As a result, 

p(j GXo<z^;,s(Xo)n5(/3) /0) < 12 p{kiu;,h^^), 



2.3 



with probability at least 1 — o{l/p), Q* is i^T-sparse with K = C{\og{p)Y^^ , and there 
is a finite integer Hq such that \Xq\ < Iq. As a result, there are at most finite Xq such that 
the event {j G Iq <\U*} is non-empty, and for each of such Xq, B(Io) contains at most Lp 
nodes. Using (2.41) and Lemma 2.2 a direct result is 

p p 



Y,P{j e To<iU;,B{Io) n S{(3) / 0) < LpiJ^p-"'^ + 0(1). 

i=i j=i 



(2.42) 



Comparing (2.42) with (2.40), to show the claim, it is sufficient to show that 

^2 ^(sgn(/3,) / sgn(/3,), j G Xo<U;,B{X^) n S{P) = 0) < Lplj^P^'^ + o(l). 

3=1 3=1 

(2.43) 

Fix 1 < j < p and a connected subgraph Xq such that j G Xq. For short, let S be the 
support of and S" be the support of The event {sgn(/3j) 7^ sgn(^j),j G Xq <lW*} is 
identical to the event of {sgn(/3j) 7^ sgn(/3j), j G 5* US '). M oreover, Since Xq has a finite size, 
both S and S have finite possibilities. So to show (2.43), it is sufficient to show that for 
any fixed 1 < j < p, connected subgraph Xq, and subsets Sq, Si C Xq such that j £ SqU Si, 



P(sgn(/3,) / sgn(/3,),5 = 5o,5 = Si,j G Xq S(Xo) n 5(/3) = 0) < Lpb"^. +p 



-(m+l)i9j 

' (2.44) 

We now show (2.44). The following lemma is proved in [251 A. 4]. 

Lemma 2.4 Over the event {j G Xq n {5(Xo) n 5(/3) = 0}, \\{Q/3fo-n^O'^o^^o\\^ < 
Crp(log(p))-(i-^). 

Write for short M = G^o.^o and M = O^o-^o. By definitions, is the minimizer of the 
following functional 



1 



where ^ is an \Xq\ x 1 vector whose coordinates are either or > in magnitude, u^^ = 
Y^2t?log(p), and v^^ = \Jlr log(p). In particular, 

Q(^^")>g(/3^"), 

/3^«)'M(/3^« - /3^o) + (|5i| - |So|)t?log(p). (2.45) 



or equivalently 

(/3^o - /3^oy(yX(. 



1 
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Now, write for short 5 = t~'^0^° - P^°yM0^o - First, by Schwartz inequahty, 

[0^0 - f3^oy(Y^o _ M/32:o)]2 < Sr^(Y^o _ M/3^o)'M-i(y^o - M/3^o). Second, by Lemma 
2^ Y^o = w + + rem, where w ~ N{0,M) and with probabiHty 1 - o{l/p), 

\re m\ < C (log(p))~(^~^)Tp. Last, with proba biHty at least (1 - o{l/p)), \M - M|oo < 
C ■yyiog{p)p~^'^^^^^'^^^^'^ . Inserting these into (2.45) gives that with probabiHty at least 
(1 - o{l/p)) 



1 



dr + 



(l^il-l^ol)^ 



6r 



(21og(p)) + 0((log(p)r 



Since 7 < 1, 0{{log{p)y) is negligible. We note that w'M ~ X\Xo\(^)- Inserting this 
back to (|2.44l) , the left hand side 



< f "'^(xfxo|(0) > [{VS^+{\Si\ - \So\)^/V5^)+]\\og{p)/2)) + o{l/p). 

Assume sgn(/3j) 7^ sgn(/3j), and fix all parameters except 6, Sq and Si. By arguments 
similar to the proof of Lemma |1.4[ the above quantity cannot achieve its maximum in the 
cases where Sq = Si. Hence we only need to consider the cases where Sq ^ Si. We also 
only need to consider the cases where maxdS'ol, |'S'i|) < m, since the sum of the probabilities 
of other cases is well controlled by pi-C'^+i)''. The claim follows by the definitions of p*. 

□ 



3 Simulations 

We conducted a small-scale simulation study to investigate the numerical performance of 
Graphlet Screening and compare it with the lasso. The subset selection is not included 
for comparison since it is computationally NP hard. We consider experiments for both 
random design and fixed design, where as before, the parameters {ep,Tp) are tied to {'&,r) 
by = P~^ and Tp = y^2r log(p) (we assume a = 1 for simplicity in this section). The 
experiments with random design contain the following steps. 

1. Fix {p,'&,r, such that fi G Qp{Tp). Generate a vector b = (61, 62, • • • j such 
that bi *~ Bernoulli(ep), and set f3 = bo p,. 

2. Fix K and let n = Up = p". Generate an n xp matrix with iid rows from A^(0, (l/n)ri). 

3. Generate Y ~ N{Xf3,Ip), and apply Graphlet Screening and the lasso. 

4. Repeat 1-3 independently, and record the average Hamming distances. 

The steps for fixed design experiments are similar, except for that Up = p and X = Jl^/^. 

Graphlet Screening uses tuning parameters {m, Q,u^^ ,v^^). We set m = 3 for our 
experiments, which is usually large enough due to signal sparsity. The choice of Q is not 



critical, as long as the corresponding parameter q satisfies (1.26). Numerical studies below 
(e.g. Experiment 4a) support this point. In principle, the optimal choices of {Q,u^'^ ,v^^) 
depend on the unknown parameters (ep, Tp), and how to estimate them in general settings 
is a lasting open problem (even for linear models with orthogonal designs). Fortunately, 
our studies (e.g. Experiment 4b-4d) show that mis-specifying parameters (e^, Tp) by a 
reasonable amount does not significantly affect the performance of the procedure. For this 
reason, in most experiments below, we set the tuning parameters in a way by assuming 
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(ep, Tp) as known. To be fair in comparison, we also set the tuning parameters of the lasso 
ideally assuming {ep,Tp) as known. We use glmnet package [TH] to perform lasso. 

The simulations contain 4 different experiments which we now describe separately. 

Experiment 1. In this experiment, we investigate how different choices of signal vector 
/? affect the comparisons of two methods. We use a random design model, and J7 is a 
symmetric tri-diagonal correlation matrix where the vector on each sub-diagonal consists 
of blocks of (.4, .4, -.4)'. Fix (p, k) = (0.5 x 10^^,0.975) (note n = p'' ^ 4,000). We let 
ep = p~^ with -d G {0.35,0.5} and let Tp € {6,8,10}. For each combination of (ep,rp), we 
consider two choices of fi. For the first choice, we let fi be the vector where all coordinates 
equal to Tp (note /3 is still sparse). For the second one, we let fj, be the vector where the 

signs of Hi = ±1 with equal probabilities, and \fii\ *~ O.Sz^rp + 0.2/i, where u-rp is the point 
mass at Tp and h{x) is the density of Tp{l + V/6) with V ~ Xi- For Graphlet Screening, the 
tuning parameters (m, u^^,v^'') are set as (3, y^2 log(l/ep), Tp), and the tuning parameter q 
in Q are set as maximal possible value satisfying (1.26). The average Hamming errors for 



both procedures across 40 repetitions are tabulated in Table [3} 



Tp 


6 


8 


10 


Signal Strength 


Equal 


Unequal 


Equal 


Unequal 


Equal 


Unequal 


i9 = 0.35 


Graphic Screening 
lasso 


0.0810 
0.2424 


0.0825 
0.2535 


0.0018 
0.1445 


0.0034 
0.1556 




0.0941 


0.0003 
0.1109 


= 0.5 


Graphic Screening 
lasso 


0.0315 
0.1107 


0.0297 
0.1130 


0.0007 
0.0320 


0.0007 
0.0254 




0.0064 




0.0115 



Table 3: Ratios between the average Hamming errors and pCp (Experiment 1), where 
"Equal" and "Unequal" stand for the first and the second choices of /U, respectively. 

Experiment 2. In this experiment, we generate /3 the same way as in the second 
choice of Experiment 1, and investigate how different choices of design matrices affect 
the performance of the two methods. Setting (p, k) = (0.5 x 10^,0.35,0.975) and Tp G 
{6, 7, 8, 9, 10, 11, 12}, we use Gaussian random design model for the study. For each method, 
the tuning parameters are set in the same way as in Experiment 1. The experiment contains 
3 sub-experiments 2a-2c. 

In Experiment 2a, we set as the symmetric diagonal block- wise matrix, where each 
block is a 2 X 2 matrix, with 1 on the diagonals, and ±0.5 on the off-diagonals (the signs al- 
ternate across different blocks) . The average Hamming errors of 40 repetitions are reported 
in Figure |3| 

In Experiment 2b, we set as a symmetric penta-diagonal correlation matrix, where 
the main diagonal are ones, the first sub-diagonal consists of blocks of (.4, .4, —.4)', and the 
second sub-diagonal consists of blocks of (.05, —.05)'. The average Hamming errors across 
40 repetitions are reported in Figure [3j 

In Experiment 2c, we generate as follows. First, we generate il. using the function 
sprandsym(p,K/p) in matlab. We then set the diagonals of to be zero, and remove some 
of entries so that Vt is ET-sparse for a pre-specified K. We then normalize each non-zero 
entry by the sum of the absolute values in that row or that column, whichever is larger, 
and multiply each entry by a pre-specified positive constant A. Last, we set the diagonal 
elements to be 1. We choose = 3 and A = 0.7, draw 5 different Vt with this method, and 
for each of them we repeat the simulation 10 times independently. The average Hamming 
errors are reported in Figure [3j 
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The results suggest that Graphlet Screening is consistently better than the lasso. 




6 8 10 12 6 8 10 12 6 8 10 12 



Figure 3: x-axis: Tp. y-axis: ratios between the average Hamming errors and pep (from left 
to right: Experiment 2a, 2b, and 2c). 



Experiment 3. In this experiment, we investigate what are the minimum signal strength 
levels Tp required by Graphlet Screening and the lasso to yield exact recovery, respectively. 
Fixing p = 10^ we let ep = p'^ for ^ = 0.25, 0.45, 0.65, and let Tp G {5, 6, 7, 8, 9, 10, 11, 12}. 
We use a fixed design model where is the block-wise matrix as in Experiment 2a. For 
each pair of {ep,Tp), we generate /3 as in the second choice of Experiment 1. The tuning 
parameters for Graphlet Screening and the lasso are set in the same way as in Experiment 
1. The average Hamming errors across 20 repetitions are tabulated in Table|4| 

Suppose we say a method yields 'exact recovery' if the average Hamming error < 3. 
Then the minimum Tp for Graphlet Screening to yield exact recovery is ~ 9, but that 
for the lasso is much larger (> 12). For larger i}, the differences are less prominent, but 
the minimum Tp for Graphlet Screening to yield exact recovery is consistently smaller than 
that of the lasso. 





Tp 


5 


6 


7 


8 


9 


10 


11 


12 




0.25 


Graphic Screening 


58 


21.4 


9.2 


3.5 


3 


1.8 


1.0 


0.6 


lasso 


75.2 


34.4 


21.6 


15 


14.3 


12.6 


10.1 


8.9 




0.45 


Graphic Screening 


11 


3.7 


0.7 


0.2 


0.1 











lasso 


13.9 


5.2 


1.3 


0.6 


0.1 


0.4 


0.2 


0.2 




0.65 


Graphic Screening 


3.4 


0.8 


0.1 


0.1 














lasso 


3.7 


1 


0.3 


0.1 















Table 4: Comparison of average Hamming errors (Experiment 3). 



Experiment 4- In this experiment, we investigate how sensitive Graphlet Screening is 
with respect to the tuning parameters. The experiment contains 4 sub-experiments, 4a-4d. 
In Experiment 4o, we investigate how sensitive the procedure is with respect to the tuning 
parameter q in Q (recall that the main results hold as long as q fall into the range given in 
( 1.26 )) , where we assume (cp, Tp) are known. In Experiment 46-4(i, we mis-specify (ep, Tp) by 
a reasonably small amount, and investigate how the mis-specification affect the performance 
of the procedure. For the whole experiment, we choose /3 the same as in the second choice 
of Experiment 1, and $7 the same as in Experiment 26. We use a fixed design model in 
Experiment 4a-4c, and a random design model in Experiment 4d. For each sub-experiment. 
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the results are based on 40 independent repetitions. We now describe the sub-experiments 
with details. 

In Experiment 4a, we choose G {0.35,0.6} and r G |1. 5, 3). In Graphlet Screening, 

let 

Qmax — Qmax{D, F) be the maximum value of q satisfying ( |l.26[ ). For each combination 
of (??,r) and {D,F), we choose q{D,F) = qmax{D,F) x {0.7,0.8,0.9,1,1.1,1.2} for our 
experiment. The results are tabulated in Table [5| which suggest that different choices of q 
have little influence over the variable selection errors. We must note that the larger we set 
q{D,F), the faster the algorithm. 



q{F,D)/qmax{F,D) 


0.7 


0.8 


0.9 


1 


1.1 


1.2 


{'d,r) = (0.35,1.5) 


0.0782 


0.0707 


0.0661 


0.0675 


0.0684 


0.0702 


(i?,r) = (0.35,3) 


0.0066 


0.0049 


0.0036 


0.0034 


0.0033 


0.0032 


(??,r) = (0.6,1.5) 


0.1417 


0.1417 


0.1417 


0.1417 


0.1417 


0.1417 


(i9,r) = (0.6,3) 


0.0089 


0.0089 


0.0089 


0.0089 


0.0089 


0.0089 



Table 5: Ratios between the average Hamming errors of Graphlet Screening and pep (Ex- 
periment 4a). 

In Experiment 46, we use the same settings as in Experiment 4a, but we assume 
(and so e^) is unknown (the parameter r is assumed as known, however), and let "d* is 
the misspecified value of -d. We take G i? x {0.85,0.925,1,1.075,1.15,1.225} for the 
experiment. 

In Experiment 4c, we use the same settings as in Experiment 4a, but we assume r 
(and so Tp) is unknown (the parameter "i? is assumed as known, however), and let r* is the 
misspecified value of r. We take r* = r x {0.8,0.9, 1, 1.1, 1.2, 1.3} for the experiment. 

In Experiment 46-4c, we run Graphlet Screening with tuning parameters set as in Exper- 
iment 1 , except "i? or r are replaced by the misspecified counterparts tD* and r* , respectively. 
The results are reported in Table [6| which suggest that the misspecifications have little 
effect as long as r* /r and t?*/"!? are reasonably close to 1. 





0.85 


0.925 


1 


1.075 


1.15 


1.225 


i'd,r) = (0.35,1.5) 


0.0799 


0.0753 


0.0711 


0.0710 


0.0715 


0.0746 


{■d,r) = (0.35,3) 


0.0026 


0.0023 


0.0029 


0.0030 


0.0031 


0.0028 


i'd,r) = (0.6,1.5) 


0.1468 


0.1313 


0.1272 


0.1280 


0.1247 


0.1296 


(t?,r) = (0.6,3) 


0.0122 


0.0122 


0.0139 


0.0139 


0.0130 


0.0147 




0.8 


0.9 


1 


1.1 


1.2 


1.3 


{'d,r) = (0.35,1.5) 


0.0843 


0.0731 


0.0683 


0.0645 


0.0656 


0.0687 


{■d,r) = (0.35,3) 


0.0062 


0.0039 


0.0029 


0.0030 


0.0041 


0.0054 


(i?,r) = (0.6,1.5) 


0.1542 


0.1365 


0.1277 


0.1237 


0.1229 


0.1261 


(i9,r) = (0.6,3) 


0.0102 


0.0076 


0.0085 


0.0059 


0.0051 


0.0076 



Table 6: Ratios between of the average Hamming error of the Graphlet Screening and pep 
(Experiment 46 (top) and Experiment 4c (bottom)). 

In Experiment 4d, we re-examine the misspecification issue with a random design. We 
use the same settings as in Experiment 46 and Experiment 4c, except for (a) while we 
use the same Q as in Experiment 46, the design matrix X are generated according to the 
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random design model as in Experiment 2b, and (b) we only investigate for the case of r = 2 
and -i? G {0.35,0.6}. The results are summarized in Table [7| which is consistent with the 
results in 46-4c. 





0.85 


0.925 


1 


1.075 


1.15 


1.225 


(i?,r) = (0.35,2) 


0.1730 


0.1367 


0.1145 


0.1118 


0.0880 


0.0983 


(t?,r) = (0.6,2) 


0.0583 


0.0591 


0.0477 


0.0487 


0.0446 


0.0431 


1 7* 


0.8 


0.9 


1 


1.1 


1.2 


1.3 


(^?,r) = (0.35,2) 


0.1881 


0.1192 


0.1275 


0.1211 


0.1474 


0.1920 


(i?,r) = (0.6,2) 


0.0813 


0.0515 


0.0536 


0.0397 


0.0442 


0.0510 



Table 7: Ratios between the average Hamming errors of Graphlet Screening and pe^ (Ex- 
periment 4(i). 



4 Connection to existing literature and possible extensions 

Our idea of utilizing graph sparsity is related to the graphical lasso [271 ttZ] ■, which also 
attempts to exploit graph structure. However, the setting we consider here is different from 
that in |27|ll7j. and our emphasis on precise optimality and calibration is also very different. 
Our method allows nearly optimal detection of very rare and weak effects, because they 
are based on careful analysis that has revealed a number of subtle high-dimensional effects 
(e.g. phase transitions) that we properly exploit. Existing methodologies are not able to 
exploit or capture these phenomena, and can be shown to fail at the levels of rare weak 
effects where we are successful. 

The paper is closely related to the recent work by Ji and Jin p5] (see also [15l [20]). 
and two papers use a similar rare and weak signal framework and a similar random design 
model. However, they are different in important ways, since the technical devise developed 
in [25] can not be extended to the current study. For example, the lower bound derived in 
this paper is different and sharper than that in [25]. Also, the procedure in [25] relies on 
marginal regression for screening. The limitation of marginal regression is that it neglects 
the graph structure of GOSD for the regularized Gram matrix (1.5), so that it is incapable 
of picking variables that have weak marginal correlation but significant joint correlation to 
Y. Correct selection of such hidden significant variables, termed as the challenge of signal 
cancellation [3l], is the difficulty at the heart of the variable selection problem. One of 
the main innovation of Graphlet Screening is that it uses the graph structure to guide the 
screening, so that it is able to successfully overcome the challenge of signal cancelation. 

Additionally, two papers have very different objectives, and consequently the underlying 
analyses are very different. The main results of each of these two papers can not be deduced 
from the other. For example, to assess optimality [25] uses the criterion of the partition of 
the phase diagram, while the current paper uses the minimax Hamming distance. Given 
the complexity of the high dimensional variable selection, one type of optimality does not 
imply the other, and vice versa. Also, the main result in |25| focuses on conditions under 
which the optimal rate of convergence is Lpp^^^'^^^'f /^^^^ for the whole phase space. While 
this overlaps with our Corollaries 1.2 and 1.3, we must note that [23] deals with the much 



more difficult cases where r/i? can get arbitrary large; and to ensure the success in that 
case, they assume very strong conditions on the design matrix and the range of the signal 
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strength. On the other hand, the main focus of the current paper is on optimal variable 
selection under conditions (of the Gram matrix G as well as the signal vector j3) that are 
as general as possible. 

While the study in this paper has been focused on the Random Design model RD('(?, k, Q), 
extensions to deterministic design models are straightforward (in fact, in Corollary we 
have already stated some results on deterministic design models), and the omission of dis- 
cussion on the latter is largely for technical simplicity and the sake of space. In fact, for 
models with deterministic designs, since the likelihood ratio test in the derivation of the 
lower bound matches the penalized MLE in the cleaning step of the GS, the optimality of 
the GS follows from the Sure Screening and Separable After Screening properties of the GS. 
The proof of these properties, and therefore the optimality of the GS, follows the same line 
as those for random design as long as maxj | /3iG{i, {i, j) = 0}|/rp is small. This 

last condition on G holds whenp^~''||G'— ri||oo = o(l) with a certain U G A^*(7, cq, g, A). Al- 
ternatively, this condition holds when p^~^\\G — Q\\'^logp = o(l) with Q G ^A*{'y,co, g, A), 
provided that sgn(/3j) are iid symmetric random variables as in [5]. 

Another interesting direction of future research is the extension of the GS methodology 
to more general models such as logistic regression. The extension of the lower bound in 
Theorem 1.1 is relatively simple since the degree of GOLF can be bounded using the true 
p. This indicates the optimality of the GS method in logistic and other generalized linear 
models as long as proper generalized likelihood ratio or Bayes tests are used in both the 
GS- and GC-steps. 



5 Proofs 

In this section, we provide all technical proofs. For simplicity, we assume o" = 1 in this 
section. 

5.1 Proof of Lemma Jl.ll 

When contains a connected subgraph of size > m + 1, it must contain a connected 
subgraph with size m + 1. By there are < p{eK)"^^^ connected subgraph of size 
m + 1. Therefore, the probability that Qg has a connected subgraph of size (m + 1) 
< p{eK)"^^^€'^^^ . Combining these gives the claim. □ 

5.2 Proof of Theorem O 

Write for short p* = />*("!?, r, a, fi). Without loss of generality, assume p* < P2 — • • • — Pp- 
We construct indices ii < 12 < ■ ■ ■ < im follows, (a) start with B = {1,2, . . . ,p} and let 
ii = 1, (b) updating B by removing ii and all nodes j that are connected to ii in GOLF, 
let i2 be the smallest indices, (c) defining 13,14, ... ,im hy repeating (b), and terminates the 
process when no indices is left in B. Since each time we remove at most dp{G^) nodes, it 
follows that 

p m 

j=i k=i 

For each 1 < j < p, as before, let {Vqj, V-^j) be the least favorable configuration, and let 
= argmin|g(o)gB^,, sgn(e{o))-^sgn{e(i))}«(^^°^^^^^;^)- % our notations. 
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it is seen that 



(5.47) 



We construct a p x 1 vector fj.* as follows. Fix j £ {ii, - ■ ■ ,im}- For all indices in Vg* , set 

the constraint of fi* on these indices to be . For any index i ^ U^^^Vq*^ , set fi* = Tp. 
Since 

p 

Ramm*{i!),K,r,a,n) > inf Hp{(3;ep, Hp, fi* ,n) = inf V P(sgn(/3j) ^ sgn(/3j)), (5.48) 
it follows that 



Hamm*(i?, k, r,a,n) >^ ^ P(sgn(/3j) / sgn(/3j)), 

fc=iieyoifeUyHj^ 



(5.49) 



where /3 = b o fi* as in (5.48)-(5.49). Combining (5.46) and (5.49), to show the claim, we 

(5.50) 



only need to show that for any 1 < k < m and any procedure /3, 



^ P(sgn(/3,) / sgn(/3,)) > Lpp . 



Towards this end, we write for short Vq = Voi^, Vi = Vu^^, V = VoUVi, 6'(°) = 6^;"^, and 



6»(i) = e[]^ . Note that by Lemma 



1.3 



\V\ < {d + rf/{2^r). 



(5.51) 



Consider a test setting where under the null Hq^ P = (3^^^^ = ho^* and lyofi^'^^ = /yo^C^), and 
under the alternative ffi, /3 = /S^^-* which is constructed by keeping all coordinates of 
unchanged, except those coordinates in V are perturbed in a way so that/v/o/3{i) = Iyo6l(i). 
In this construction, both /J^*^) and /J^^^ are assumed as known, but we don't know which 
of Hq and Hi is true. In the literature, it is known that inf^ ^^^y P(sgn(/3j) / sgn(/3j)) 
is not smaller than the minimum sum of Type I and Type II errors associated with this 
testing problem. 

Note that by our construction and (5.47), the right hand side is a* {Vq^Vi]^). At 
the same time, it is seen the optimal test statistic is Z = (6'(^) - 6'(°))'X'(y - X/3(°)). 
It is seen that up to some negligible terms, Z ~ A^(0, Q!*(Vo, Vi; ri)rp) under Hq, and 
Z ~ N{a*{VQ,Vi\^)Tp,a*{Vo,Vi;^)Tp) under Hi. The optimal test is to reject Hq when 
Z > t[a* {Vo,Vi;0,)]^^'^Tp for some threshold t, and the minimum sum of Type I and Type 
II error is 

mf{el''<>\m+el^^\Ht-[a*{Vo,Vi-,n)]'/\)}, (5.52) 



Here, we have used P{Ho) ~ ej,^°' and P{Hi) ~ ep , as a result of the Binomial structure 
in /3. It follows that 

^P(sgn(/3,) ^ sgn(/3,)) > mi{e\^<^\m + e^^Mt - [a*iVo,Vi-,n)]'/\p)} . 
iev 



l^il 



Using Mills' ratio and definitions, the right hand side > Lpp viVoyi;^)^ a,nd (5.50) follows 
by recalling (5.47). □ 
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5.3 Proof of Lemma 11.31 



Let T/q = and Vi = {j}. It is seen that q*(Vo, Vi] ^) = 1, and 7/(^0, Vi] ^) < {'& + rf/{Ar). 
Using this and the definitions of Vq* and V^* , max{|Voj|, iVj*,!}-!? <("!? + r)2/(4r) and the 
claim follows. □ 



5.4 Proof of Lemma 11.41 



Let sets Vq and Vi and vectors O^^^f and 6^'^'> be as in Section 1.6, and \ei V = VqVJVi. By 
the definition of p* r, a, 0), 

p*j{'&, r, o, S7) = min(/, II), 

where 

1= min 77(^0,^1; 5^), 11= min 7/(^0, 14; 0). 

{(vb,Vi):ieyiuFo,v'o7^yi} {Vo:j&VouVi,Vo=Vi} 

So to show the claim, it is sufficient to show 

/= min p(D,F:n), II > I. (5.53) 

{(D,F):ieDUF,DnF=0,D7^0,|DUF|<g} 



Consider the first claim in (5.53). Write for short F = F{Vq, Vi) = Vq DVi and D = 
D(yo, Vi) = V\F. By the definitions, 15/0. The key is to show that when |Vb U Vi| <g, 

a*{Vo,Vi;n) = uj{D,F;n). (5.54) 
Towards this end, note that by definitions, a*(Vb, Vi; Q) = a{9i^\ oi^^), where 



,{0) 



argmin|g(o)g5^^^e(i)gB^ja(^(°),0(^)). 



By a > a*{Q,) and the way a*{Q) is defined, {oi^\9i^^) remains as the solution of the 
optimization problem if we relax the conditions 0*-*^ G -By- to that of 6^^^ = /y. o where 
/z^*-* G Qp{Tp) (so that upper bounds on the signal strengths are removed), z = 0, 1. As a 
result, 

a*{Vo,Vi;n) = min a{e^°\e^^'>). (5.55) 

{e(»)e/^^o^{»),/i(')eep{Tp),i=o,i,} 



We now study \5.55\ . For short, write ^ = ^^(^(i) - O^^^^)^ , Qyv = ^^'^ , = Tp'^iO'^^^ - 
9^^^)^, and similarly for i^DD, ^DF, ^FD, ^FF, and S^p. Without loss of generality, assume 
the indices in D come first in V. It follows 

Q _ / r^DD ^DF 
\i^FD ^FF 

and 

a(0(O), 0(1)) = ^'QvV^ = io^DDiD + 2i'D^DFiF + ip^FF^F- (5.56) 

By definitions, it is see n that there is no constraint on the coordinates of so to optimize 

the quadratic form in (5.54), we need to choose ^ is a way such that = —Q.'^\,Vlpp)iD-, 
and that minimizes 

i'oi^DD — ^df^pp^fd)^d, 
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where every coordinate of > 1 in magnitude. Combining these with (5.55) gives (5.54). 
At the same time, we rewrite 



mm 

{(D,F):je-DUF,D^0,_DnF=0} 



( 



mm r]{Vo,Vi;n)y 



(5.57) 



By similar arguments as in the proof of Lemma 1.3, the subsets (Vq, Vi) that achieve the 
minimum of r]{Vo,Vi;Q) must satisfy |Vo U Vi\ < g. Using (5.54), for any fixed D and F 
such that \DL) F\ < g, D ^ (}> and DCiF = 0, the term in the big bracket on the right hand 
side is 



min I 

{{VQ,Vi):Vo'JVi=DuF,VonVi=F} 



{2\F\ + \D\)i} llFil -l^oll^? 1 



- + 



+ -[{Vp{D,F;n), 



ll^il-l^oll^ 
\/piD,F;n)r 



It is worth noting that for fixed D and F, the above quantity is monotone increasing with 
1 1^1 1 — l^oll- When l^l is even, the minimum is achieved at {Vo,Vi) with |Vo| = l^ili and 



when \D\ is odd, the minimum is achieved at (Vo,Vi) with ||Vi| — |Vo|| - 
cases, the minimum is p{D, F;^). Inserting this to ( 5.57| ), it is seen that 



min p(D,F:^}), 

{{D,F):j(^DUF,DnF=d,Dft<$,\DUF\<g} 



1, and in both 
(5.58) 



which is the first claim in (5.53). 

Consider the second claim of (5.53). In this case, by definitions, Vq = Vi but sgn{6^^^) ^ 
sgn(^(^)). Redefine D as the subset of Vq where the signs of the coordinates of 6^^^ do not 
equal to those of 9^^\ and let F = V \ D. By definitions, it is seen that a*{Vo, Vq; 0) = 
4a* (F, Vo;i^), where we note D ^ $ and F ^Vq. By the definition of r/(Vo, Vi;Cl), it follows 
that ri{Vo, Vq; 0,) > ri{F, Vq; 0,), and the claim follows. □ 



5.5 Proof of Corollaries [g and [O 



Write for short u = p(D,F;^}) and T = rj'd. The following inequality is frequently used 
below, the proof of which is elementary so we omit it: 



where k 



1^1 + li^l- 



(5.59) 



To show these corollaries, it is sufficient to show for all subsets D and -F of {1, 2, . . . 



p{D,F;n) > (^9 + r)V(4r), 



II^I > 1. 



where p{D,F;Q) is as in ( 1.23| ). By basic algebra, (5.60) is equivalent to 



(wT + 1/{ujT) - 2)1{ujT > 1} > (T + 1/r - 2(|D| + 2|F|)) 
uj> ^[iT + l/T)/2 + l-{\D\+2\F\)], 



\D\ is odd, 



(5.60) 



(5.61) 



Note that when {\D\, \F\ 
the case where 



|-D| is even. 

(1,0), this claim holds trivially, so it is sufficient to consider 



\D\ + IFI > 2. 



(5.62) 



We now show that (5.61) holds under the conditions of each of corollaries. 
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5.5.1 Proof of Corollary 

In this corollary, 1 < (T + l/r)/2 < 3, and if either (a) \D\ + 2\F\ > 3 and \D\ is odd 



or (b) \D\ + 2\F\ > 4 and \D\ is even, the right hand side of (5.61) < 0, so the claim 
holds trivially. Therefore, all we need to show is the case where (IL'IjIFI) = (2,0). In 
this case, since each off-diagonal coordinate < — 5 = pp, it follows from definitions 
and basic algebra that uj > 2(1 — po) = 4(3 — 2^/2), and (5.61) follows by noting that 

|[(r + i/r)/2 + 1 - {\D\ + 2\F\)] = (1 - i/r)2 < 4(3-2^/2). □ 



5.5.2 Proof of Corollary [TS] 

In this corollary, 1 < (T + l/T)/2 < 5. First, we consider the case where \D\ is odd. By 
similar argument, ( 5.61[ ) holds trivially when \D\ + 2\F\ > 5, so all we need to consider is 
the case {\D\,\F\) = (1,1) and the case {\D\,\F\) = (3,0). In both cases, \D\ + 2\F\ = 3. 
By (5.59), when ujT < 1, there must be T < 1/ min(A2,3A3). By the conditions of this 
corollary it fohows T < (5 + 2^/6)/4 < 3 + 2^/2. When 1 < T < 3 + 2^/2, there is 
T + l/T- 



6 < 0, and thus (5.61) holds for uT < 1. When ojT > 1, (5.61) holds if and only 
2 > T + 1 /T — 6. By basic algebra, this holds if 



->^[(1 



i/r) + V(i-i/r)2-4/r]- 



(5.63) 



Note that the right hand of ( 5.63 ) is a monotone in T and has a maximum of (3 + 2\/ 2)(5 
2\/6) at r = (5 + 2^6). Now, on one other hand, when {\D\,\F\) = (1,0), by (fstsDl) 



and conditions of the corollary, co > 3A3 > (3 + 2^/2) [5 — 2^/6). On the other hand, 
when (|Z)|,|F|) = (1,1), by basic algebra and that each off-diagonal coordinate of 17 < 

w > 1 - /o? = (3 + 2\/2)(5 - 2V6). 



1 + {V6- \/2)/(l + a/372) = pi in magnitude. 
Combining these gives ( |5.61 ). 



We now consider the case where \D\ is even. By similar argument, (|5.61|) holds when 
I +2|. 

mAF\ 



D\ -|- 2|F| > 6, so all we need is to show is that (5.61) holds for the following three cases: 
(4,0), (2,1), (2,0). Equivalents, this is to show that w > ^[{T+l/T)/2 - 3] 
in the first two cases and that a; > ^[{T + l/T)/2 - 1] in the last case. Similarly, by 
the monotonicity of the right hand side of these inequalities, all we need to show is ci; > 
4(5 — 2^/Q) in the first two cases, and uj > 8(5 — 2^/6) in the last case. Now, on one 



hand, using (5.59), u > 4A4 in the first case, and uj > 2A3 in the second case, so by the 
conditions of the corollary, uj > 4(5 — 2\/6) in the first two cases. On the other hand, in 
the last case, since all off-diagonal coordinates of < 8\ /6 — 19 = po in magnitude, and 
UJ > 2(1 — Po) = 8(5 — 2\/6). Combining these gives (5.61). □ 



5.5.3 Proof of Corollary 1.4 



Let be the unique integer such that 2N — 1 < (T + l/T)/2 < 2N + 1. First, we consider 
the case where \D\ is odd. Note that when \D\ + 2\F\ > 2N + 1, the right hand side of 
(5.61) < 0, so all we need to consider is the case \D\ + 2\F\ < 2N — 1. Write for short 
k = k{D, F) = \D\ + \F\ and j = j{D, F) 
that \D\ + 2\F\ 



\D\ + \F\ and j = j{D,F) = {\D\+2\F\ + l)/2. By (|02|), definitions, and 
< 2iV - 1, it is seen that 2 < k < 2N 
By the condition of the corollary. 



1 and {k + l)/2<j < min{k,N}. 



XI > 



{T + l/r)/2 -2j + 2 + ^[{T + l/r)/2 - 2j + 2]^ - 1 
T{2k - 2j + 1) 
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Note that \D\ = 2k — 2j + 1. Combining these with ( 5.59| ) gives 

uT > {2k - 2j + 1)XIT > (T + l/T)/2 - 2 j + 2 + y/[(T + l/T)/2 - 2j + 2]^ - 1 > 1. 



and (5.61) follows by basic algebra. 

We now consider the case where \D\ is even. Similarly, the right hand side of (5.61) 
is negative when \D\ + 2\F\ > 2{N + 1), so we only need to consider the case where 
\D\+2\F\ < 2N. Similarly, write for short k = k{D,F) = \D\ + \F\ and j = (IZ^I + 2|F|)/2. 
It is seen that 2 < k < 2N and k/2 < j < min{fc — 1, A^}. By the conditions of the corollary. 



At > 



(T + l/T)/2 + 1 - 2j 
T{k-j) 



Note that |L>| = k-j. It follows from (5.59) that oj > 2{k-j)Xl > |[(r + l/r)/2 + l-2j], 
and (5.61 ) follows. 



□ 



5.6 Proof of Lemma 12.21 

To show the claim, it is sufficient to show that for any fixed 1 < j < p, 

p{j i u;,p, / 0) < Lpip-".* + o{i/p)]. 



(5.64) 



Using Lemma 2.1 and [25^ Lemma 3.1], there is an event Ap that depends on (X, /?) such 
that P{Ap) < o{\/p) and that over the event, il* is ET-sparse with K = C{\og{p)Y/'^ , 

\\^* - felloe < (log(p))-(i-^), \\{X'X - < C||l^||y2Mp)p-[(--(i-'')l/2, 

and for all subset B with size < m, 

\\G^^^ -^B^^\\^<Lpp--l\ 

Recall that Q* is the GOSD and Q*g is the subgraph of the GOSD formed by the nodes in 
the support of /3, S{j3) = {1 < j < p : /3j 7^ 0}. When [5j 7^ 0, there is a unique component 
Xq such that j G Xq <I^J {A<\B means that A is component or maximal connected subgraph 
of B). Let Bp be the event |Xo| < m. By Frieze [19], it is seen that 

P{Bl n Ap) < Lpp-('"+i)'^. 



So to show (5.64), it is sufficient to show that 

P{j i U;,j E Xo < Gl, Ap n Bp) < Lpp-P* 



(5.65) 



Now, in the screening procedure, when we screen Xq, we have Xq = D U F as in (1.7). 
Since the event {j ^ U*,i € Xq <] Qg] is contained in the event {T{Y, D, F) < t{D, F)}, 



p{j i u; ,j Gio<g*s,Apn Bp) < p{T{Y, b, P) < t{b, F),j Gio<g:s,Apn b, 

where the right hand side does not exceed 

P{T{Y, D,F)< t{D, F),j eIo<g*s,Apn Bp); 

{Xo,D,F):j G Xq & Xq = D U F is a partition 

(5.66) 
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note that (Xq, D,F) do not depend on z (but may still depend on (X, P)). First, note that 
over the event Ap, there are at most {eK)"^^^ Iq such that j G Xq and < m. Second, 
note that for each Iq, there are only finite ways to partition it to D and F. Last, note 
that for any fixed j and Xq, P{j G Xq <l Gg) < ej, ' . Combining these observations, to show 



(5.65), it is sufficient to show that for any such triplet (Xq, D, F), 

ef«lp(T(y, D, F) < t{D, F)\{j e Xq < g*s} nApH Bp) < L 



pP ^J . 



(5.67) 



We now show ( |5.67 ). Since Am > C > 0, it follows from the definition of Ap and basic 
algebra that for any realization of {X, (3) in Ap n Bp, 



'In A — 1 1 



||(G^«'^o 



< C. 



(5.68) 



Recall that Y = X'Y and denote for short y = (C^TcXo^-iy^o, it ig seen that 

y = /3^o + ^ + ^g^^ Af(0, (G^«'^o)~^), rem= (G^«'^o)"^G^«'^o/3^o. (5.69) 

Since Xq is a component of Qg, (Q*)-^'-^'-^o p^o = Q. Therefore, we can write rem = {G'^°'^°)^^{I+ 
II), where / = {G^o -n^o,TS)p^ and II = [n^o,^ _(^^*fo,i:Sji3^ _ By the definition of ^p, 
||/||oo < Cy2fo^p-I--(i-'')l/2, and ||/I||oo < \\n-n*\U\l3^m^ < CTp{log{p))-'-^-^\ 
Combining these with (5.68) gives ||rem||oo < Crp(log(p))~^"'^~'''^. 

At the same time, let yi, wi, and rem^ be the restriction of y, w, and rem to indices 
in D, correspondingly, and let H = [G^'^ - G^^^ {G^^^)-^G^^^\. By ( |5.69[ ) and direct 
calculations, 

T{Y, D, F) = y[Hyi, y, ~ iV(/3^ + rem\H~'), 
and so T{Y, D, F) is distributed as x'\£)\{^)i where the non-central parameter is 

(/3^ + rem^)'H{P^ + rem^) = 6 + 0((log(p))^), 6 = {P^)'HP^; 

where since ^ C, 6 > CTp and is the dominating terms. It follows that 

P{T{Y,D,F) < t{D,F)\{j G Xo < g%] r^ Apr^ Bp) < P{x\D\i^) < t{D,F)). (5.70) 

Now, first, by definitions, 6 > 2uj{D, F;il.)rlog{p), so by basic knowledge on non-central 

P{XiD\ W < tiD, F)) < Pixfm (MD, F; ^)r \og{p)) < t{D, F)). 



Second, recalling t(D,F) = 2qlog{p), we have 

P{xfn\{MD,F;n)rlog{p)) < t{D,F)) < Lpp- 



[{^u{D,F;n)T-^)^ 



(5.71) 
(5.72) 



Inserting (5.71 )-(5.72) into (5.70) and recalling tp = p , 



ef«ip(r(y, D, F) < t{D, F)\{j E Xo < g*s] n n Bp) < Lpp-(i^«i''+[(V'^(^'^'^)'^-v^)+i'). 

(5.73) 

By the choice of q and direct calculations. 



\IoW + [iV^iD,F;n)r - ^)+f > p{D, F; ^) > 



Pi 



(5.74) 



where p{D,F;Q) and p*^ are as in and (1.23) and (1.21). Combining (5.73)-(5.74) gives 
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5.7 Proof of Lemma 12.31 

In the screening stage, suppose we pick the threshold t{D,F) = 2qlog{p) in a way such 
that there is a constant qQ{-d,r) > such that 

q = q{D,F)>qo{i3,r,K)>0, 

Recall that G* denotes the GOSD. Let U* be the set of retained indices. Viewing it as a 
subgraph of Q*, U* decomposes into many components 

Recall that Y = X'Y. The following lemma is proved below. 

Lemma 5.1 There is a constant ci = ci{-d, r, n, 7, j4) > such that with probability at least 
1 — o{l/p), for any componentlo <U*, H^""^"!!^ > 2ci|Xo| log(p). 

The remaining part of the proof is similar to that of Lemma 2.3] so we omit it. We 
note that however Lemma |5.1| is new and needs a much harder proof. □ 

5.7.1 Proof of Lemma 15.11 

First, we need some notations. Let Xq be a component of U*, and let Xq , 1 < i < A^Qj 
be all connected subgraphs with size < m, listed lexicographically, where A'^o is an integer 
that may depend on (X,y). For each 1 < i < iVo, let xj*^ = U F^^ be the exactly 
the same partition when we screen Xq*^ in the m-stage x'^-screening of the GS-step. In out 

list, we only keep Xq*'' such that D^"^^ nXo / 0. Since Xq is a component olU* and Xq*^ is 
a connected subgraph, it follows from the way that the x^-screening is designed and the 
definition of Z)^*) that 

X'i^ C Xo, and ^(*) = Z'i^ \ {u'~},4'^), 1 < i < iVo- (5.75) 

and 

Xo = U £'(2) _ , , y ]j{No) ig a partition, (5.76) 

where F^^'^ is empty. 

Now, for each I < i < Nq, recall that as long as G o '^0 is non-singular, the x test 
score in Graphlet Screening is T{Y,b^^^,F^^) = T{Y,b'^^,F^^;l'^\x, p,n): 

T{Y,b^^,F^''>) = (y4'')'(G'4^4'')-ly4'' _ 

By basic algebra and direct calculations, it can be verified that 

T{Y,b^'\F^'^) = \\Wif, 

where Wi = W(Y , b^''\ F^'^^;Iq\ X,p,n) is defined as Wi = V'^^'^yi, and for short. 

At the same time, for a constant (5 > to be determined, define Q by 

n{i,j) = GizJ)-l{\G{i,j)\>6}. 
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The definition of Q is the same as that of il.* , except for that the threshold 5 would be 
selected differently. We introduce a counterpart of Wi which we call W* , 

Wt = V~^l\\. (5.77) 

where 

Let = {{Wl)\ {Wl)', {W^fjy, and define \lo\ x \lo\ matrices Hi and H2 as follows: 

— 1 /2 ^ X X 

Hi is a diagonal block-wise matrix where the i-th block is , and H2 = H2°' where 

H2 is a p X p matrix such that for every component Iq ofU*, and -D^*) and F^^^ defined on 
each component, 

and that the coordinates of H2 are zero elsewhere. Here Ik stands for kx k identity matrix. 
From the definitions, it is seen that 

W* = HiH2Y^'\ (5.78) 

Compared to Wi, W* is relatively easier to study, for it induces column-sparsity of H2- 
In fact, using [25l Lemma 2.2, 3.1], there is an event Ap that depends on {X,/3) such that 
-P(^p) ^ o(l/j»^) and that over the event, for all subset B with size < m, 

The following lemma is proved below. 

Lemma 5.2 Fix 6 > 0. Over the event Ap, there is a constant C > such that each row 
and column of H2 has no more than C nonzero coordinates. 



We are now ready to show Lemma 5.1, To begin with, note that since we accept 



when we graphlet-screen Iq^ , 

\\Wif >2qo\D'^'^\log{p). (5.80) 

II —1/2 II II 

At the same time, by basic algebra, — W*\\ < \\Vi ||||yj — and 

— l/2ii 

First, since > C, it is seen that over the event Ap, \\V^ || < C'. Second, by similar 
reasons, it is not hard to see that except for probability o(p~^), (C'-P'*'^^*'')-! _ 

(^)DW,FW((j^)FW,FW)-i||^ < C6^-\ and ||y^'^'|| < Cyiog(p) < Crp. Combining these 
gives 

\\Wi-W*\\ <C6^-^Tp, (5.81) 
Inserting this to (5.80), if we choose 5 to be a sufficiently small constant, 

\\W*f >^\\W,f > qo\D^'^log{p). 

— 1/2 1 1 

At the same time, by definitions, it follows from \\V^ \\ 1^ C that \\Hi\\ < C. Also, 
since over the event Ap, each coordinate of H2 is bounded from above by a constant in 
magnitude, it follows from Lemma 5.2 that \\H2\\ < C. Combining this with (5.76)-(5.78), 
it follows from basic algebra that except for probability o(p~^), 

golXoIMp) < \\W*f < \\HiH2Y^''f < C\\Y^"f, 

and the claim follows. □ 
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5.7.2 Proof of Lemma 15.21 



By definitions, it is equivalent to show tliat over the event Ap, each row and column of H2 
has finite nonzero coordinates. It is seen that each row of H2 has < m nonzeros, so all we 
need to show that each column of H2 has finite nonzeros. 

Towards this end, we introduce a new graph Q = (y,E), where V = {1,2, . . . ,p} and 
nodes i and j are connected if and only if 7^ 0. This definition is the same as 

GOSD, except that $7* is substituted by fi. It is seen that over the event Ap, for any 
^ G A^*(7, Co, 5, ^), G is -R'-sparse with K < C6~^^'^. The key for the proof is to show 
that for any k ^ i such that H2{k, i) 7^ 0, there is a path with length < (m — 1) m G that 
connects k and £. 

To see the point, we note that when H2{k, £) 7^ 0, there must be an i such that k E D^^^ 
and £ E F^^\ We claim that there is a path in Iq^ (which is regarded as a subgraph of Q) 
that connects k and £. In fact, if k and £ are not connected in 'Iq \ we can partition Iq^ 
into two separate sets of nodes such that one contains k and the other contains £, and two 
sets are disconnected. In effect, both the matrix j^-^'"''-^'''' and f]^''^^'''' can be visualized 
as two by two blockwise matrix, with off-diagonal blocks being 0. As a result, it is seen that 
H2{k,£) = 0. This contradiction shows that whenever H2{k,£) ^ 0, k and £ are connected 
by a path in Iq^. Since < m, there is a path < m — 1 in ^ that connects k and £ 

where k ^ £. 

Finally, since G is -fC-sparse with K = C6^^^"' , for any fixed £, there are at most finite 
k connecting to ^ by a path with length < (m — 1). The claim follows. □ 



5.8 Proof of Theorem 11.41 



Since a is known, for simplicity, we assume a = 1. First, consider (1.33). By Theorem 1.2 



and (1.23), pgs = ^'^^{{D,F):DnF=(!),Dy^(!),DuFc{i,2}} p{D, F; O), where we have used that G is 
a diagonal block- wise matrix, each block is the same 2x2 matrix. To calculate p{D, F; i}), 
we consider three cases (a) {\D\, \F\) = (2,0), (b) {\D\,\F\) = (1,0), (c) {\D\,\F\) = (1,1). 
By definitions and direct calculations, it is seen that p{D,F;Q) = ■!? -|- [(1 — |/io|)r]/2 in 
case (a), />(£>, F;0) = (?? + r)V(4r) in case (b), and p{D,F-^) = 2^+ [{^{l-hl)r - 
i9/y^(1 — /iQ)r)4.]^/4 in case (c). Combining these gives the claim. 



Next, consider (1.34). Similarly, by the block-wise structure of G, we can restrict our 
attention to the first two coordinates of /3, and apply the subset selection to the size 2 
subproblem where the Gram matrix is the 2x2 matrix with 1 on the diagonals and /iq on 
the off-diagonals. Fix (? > 0, and let the tuning parameter Xss = \/2qss log(p). Define 



fil\q) = flassoAq) = ^+[{V^- V^h?, f^Tiq) =2^+ [(^r(l - h^) - ^)^]^ 
and 

f^fiq) = 2^ + 2[{Vr{l-\ho\) - ^)+]^ 

where x+ = max{x,0}. The following lemma is proved below, where the key is to use 
Lemma 4.3]. 

Lemma 5.3 Fix q > 0, and apply the subset selection to the aforementioned size 2 subprob- 
lem with Xss = Y^2g'log(p). As p — )• 00, the worst-case Hamming error rate is Lpp~f'"'^''\ 
where /,,(g) = fss{q,^,r, ho) = min{i9 + (1 - \ho\)r/2,q, f^l\q), 0{q), 0{q)}. 
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By direct calculations, pss{'d,r, ho) = max|g>o} fss{'^,T, ho) and the claim follows. 

Last, consider (1.35). The proof is very similar to that of the subset selection, except 
for that we need to use |25l Lemma 4.1], instead of j25| Lemma 4.3]. For this reason, we 
omit the proof. □ 



5.8.1 Proof of Lemma [5731 



By the symmetry in ( 1.30 )-( 1.31 ) when G is given by ( 1.32[ ), we only need to consider that 
case where ho £ [0,1) and /3i > 0. Introduce events, = = 1^2 = 0}, Ai = {f3i > 

Tp,p2 = 0}, A21 = {/3l > Tp,/32 > Tp},A22 = {Pi > Tp, < -Tp}, Bq = {Pi = h= 0}, 

Bi = {/3i > 0, = 0}, B21 = > 0, /32 > 0} and B22 = {h > 0, /32 < 0}. It is seen that 
the Hamming error 

= Lp{I + II + III), (5.82) 

where I = P{Ao n Sg), // = P{Ai n Bf) and III = P(^2i n B^^) + P{A22 n B^^). 

Let H be the 2x2 matrix with ones on the diagonals and ho on the off-diagonals, a = 
/32)', and w = (11,12), where we recah Y = X'Y. It is seen that w ~ N{Ha, H). Write 
for short A = \/2q log(p). Define regions on the plane of (11,12)1 Do = {max(|Yi|, II2I) > 
A or + y| - 2hoYiY2 > 2X^(1 - h^}, Di = {\Yi\ < \ , Yi < % or \Y2 - hoYi\ > 
Ayrtg}, D21 = {Y2 - hoYi < Ayr^ or Yi - hoY2 < Ayr^} and D22 = {Y2 - 
hoYi > -Xy/1 - hi or Yi - hoY2 > A^l - or Y^ + Y:^ - 2hoYiY2 < 2X'^{1 - hi)}. Using 
m Lemma 4.3], we have B^ = {(l^i, 1^2)' € Do}, B{ = {(11, 1^2)' e Di}, 5|i) = {(^i, 1^2)' G 
D21}) and -B22 = {(^1,^2)' G ^22}- By direct calculation and Mills' ratio, it follows that 
for all /i G 0p(rp), 



I = Lp- {P{NiO, 1) > A) + P(xi > 2A2)) = Lp ■ p~\ 

II<Lp- P[N{{Tp,hoTp)\H) G Di) = .p-'?-mm[(v/?-Vg)^(l-/io)r/2,g]^ 

and when j3i 



(5.83) 
(5.84) 



Tp and j32 



0, the equality holds in (5.84). At the same time, note 
that over the event j42i, the worst case scenario, is where Pi = ^2 = 'Tp- In such a case, 
(11,12)' ~ N{{{1 + ho)Tp, (1 + ho)Tp)', H). Combining this with Mills' ratio, it follows that 
for all /i G 0p(rp), 



P{A2i n Bl{) = P{{Yi,Y2)' G D21) < Lp 



■P 



-2,?-(Vr(l-/.2)_^)2 



(5.85) 



and the equality holds when Pi = P2 = Tp. Similarly, note that over the event ^422, in 



the worst case scenario. Pi = —P2 = Tp. In such a case, (11,1^2)' ~ -^(((1 ~ ho)Tp, 
ho)Tp)\H). Combining this with Mills' ratio, it follows that for all /i G 0p(Tp), 



[I 



P{A22 n Bl^) = P{{Yi,Y2)' G D22) <Lp-p 



-2tf-min{[(Vr•(l-^',)-^/g) + ]^2{[Vr{l-/^o)-V9] + }2)^ 

(5.86) 



and the equality holds when Pi 
claim. 



P2 = Tp. Inserting ( 5.83 )-( 5.86) into (5.82) gives the 

□ 
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